This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
strreset_enable to indicate which reconf request type it supports,
which is described in rfc6525 section 6.3.1.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add reconf_enable field in all of asoc ep and netns
to indicate if they support stream reset.
When initializing, asoc reconf_enable get the default value from ep
reconf_enable which is from netns netns reconf_enable by default.
It is also to add reconf_capable in asoc peer part to know if peer
supports reconf_enable, the value is set if ext params have reconf
chunk support when processing init chunk, just as rfc6525 section
5.1.1 demands.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add a primitive based on sctp primitive frame for
sending stream reconf request. It works as the other primitives,
and create a SCTP_CMD_REPLY command to send the request chunk out.
sctp_primitive_RECONF would be the api to send a reconf request
chunk.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add a per transport timer based on sctp timer frame
for stream reconf chunk retransmission. It would start after sending
a reconf request chunk, and stop after receiving the response chunk.
If the timer expires, besides retransmitting the reconf request chunk,
it would also do the same thing with data RTO timer. like to increase
the appropriate error counts, and perform threshold management, possibly
destroying the asoc if sctp retransmission thresholds are exceeded, just
as section 5.1.1 describes.
This patch is also to add asoc strreset_chunk, it is used to save the
reconf request chunk, so that it can be retransmitted, and to check if
the response is really for this request by comparing the information
inside with the response chunk as well.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add asoc strreset_outseq and strreset_inseq for
saving the reconf request sequence, initialize them when create
assoc and process init, and also to define Incoming and Outgoing
SSN Reset Request Parameter described in rfc6525 section 4.1 and
4.2, As they can be in one same chunk as section rfc6525 3.1-3
describes, it makes them in one function.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
never set it again. Which means that in the future if we end up adding a bunch
of reuseport sk's to that tb we'll have to do the expensive scan every time.
Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
so we know what comparison to make, and the ipv6 only setting so we can make
sure to compare with new sockets appropriately. Once one sk has made it onto
the list we know that there are no potential bind conflicts on the owners list
that match that sk's rcv_addr. So copy the sk's information into our bind
bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
an extra check for subsequent reuseport sockets and skip the expensive bind
conflict check.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_csk_get_port we seem to be using smallest_port to figure out where the
best place to look for a SO_REUSEPORT sk that matches with an existing set of
SO_REUSEPORT's. However if we get to the logic
if (smallest_size != -1) {
port = smallest_port;
goto have_port;
}
we will do a useless search, because we would have already done the
inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
would have gone to found_tb and succeeded. Since this logic makes us do yet
another trip through inet_csk_bind_conflict for a port we know won't work just
delete this code and save us the time.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
is how they check the rcv_saddr, so delete this call back and simply
change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We pass these per-protocol equal functions around in various places, but
we can just have one function that checks the sk->sk_family and then do
the right comparison function. I've also changed the ipv4 version to
not cast to inet_sock since it is unneeded.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the functionality for including address-family-specific per-link
stats in RTM_GETSTATS messages. This is done through adding a new
IFLA_STATS_AF_SPEC attribute under which address family attributes are
nested and then the AF-specific attributes can be further nested. This
follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
advantage of presenting an easily extended hierarchy. The rtnl_af_ops
structure is extended to provide AFs with the opportunity to fill and
provide the size of their stats attributes.
One alternative would have been to provide AFs with the ability to add
attributes directly into the RTM_GETSTATS message without a nested
hierarchy. I discounted this approach as it increases the rate at
which the 32 attribute number space is used up and it makes
implementation a little more tricky for stats dump resuming (at the
moment the order in which attributes are added to the message has to
match the numeric order of the attributes).
Another alternative would have been to register per-AF RTM_GETSTATS
handlers. I discounted this approach as I perceived a common use-case
to be getting all the stats for an interface and this approach would
necessitate multiple requests/dumps to retrieve them all.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tries to avoid skb_cow_data on esp4.
On the encrypt side we add the IPsec tailbits
to the linear part of the buffer if there is
space on it. If there is no space on the linear
part, we add a page fragment with the tailbits to
the buffer and use separate src and dst scatterlists.
On the decrypt side, we leave the buffer as it is
if it is not cloned.
With this, we can avoid a linearization of the buffer
in most of the cases.
Joint work with:
Sowmini Varadhan <sowmini.varadhan@oracle.com>
Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Currently, we check the existing rtable in PREROUTING hook, if RTCF_LOCAL
is set, we assume that the packet is loopback.
But this assumption is incorrect, for example, a packet encapsulated
in ipsec transport mode was received and routed to local, after
decapsulation, it would be delivered to local again, and the rtable
was not dropped, so RTCF_LOCAL check would trigger. But actually, the
packet was not loopback.
So for these normal loopback packets, we can check whether the in device
is IFF_LOOPBACK or not. For these locally generated broadcast/multicast,
we can check whether the skb->pkt_type is PACKET_LOOPBACK or not.
Finally, there's a subtle difference between nft fib expr and xtables
rpfilter extension, user can add the following nft rule to do strict
rpfilter check:
# nft add rule x y meta iif eth0 fib saddr . iif oif != eth0 drop
So when the packet is loopback, it's better to store the in device
instead of the LOOPBACK_IFINDEX, otherwise, after adding the above
nft rule, locally generated broad/multicast packets will be dropped
incorrectly.
Fixes: f83a7ea207 ("netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too")
Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* socket owner support for connections, so when the wifi
manager (e.g. wpa_supplicant) is killed, connections are
torn down - wpa_supplicant is critical to managing certain
operations, and can opt in to this where applicable
* minstrel & minstrel_ht updates to be more efficient (time and space)
* set wifi_acked/wifi_acked_valid for skb->destructor use in the
kernel, which was already available to userspace
* don't indicate new mesh peers that might be used if there's no
room to add them
* multicast-to-unicast support in mac80211, for better medium usage
(since unicast frames can use *much* higher rates, by ~3 orders of
magnitude)
* add API to read channel (frequency) limitations from DT
* add infrastructure to allow randomizing public action frames for
MAC address privacy (still requires driver support)
* many cleanups and small improvements/fixes across the board
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYeKu7AAoJEGt7eEactAAdwjEP/RA4bXFMfkC7qUJ++cLrMMwY
yCvjb8+ULWL2wbCzpfY37acbGJgot3DNoQJzrO2jMQPqyM9nRlTMg5aF49cI7t62
gU6daNKJaGBe/0yeG7lTJ4n5UtVCDtN45hGc06Yert+ewb9njiJf+XYrtCWetsIJ
5bOLYQKPWOz/7UyMH7uJ25zrPFaiA3y7XnXKPEudagG/EwEq9ZuUpSSfLwEAEBPi
6i/2w4fLj32vXRsQMvQT0sU6mjd+1ub8Is7w5l2F06iWwNYPzdSM0IbU+E+ie2tk
sE6RA70c4ILrp8KisTAz2lJPa4XEpFkLhI3lzRRy8CVzjyyo/OJen92zvr2R7TVb
/uZG9qfRQ3UitQmgeKd+wS8PsbRAyWUR/xhNxD2r7zARH2vliwyneU+zEpXLeGA1
Y4PrN1+Fk45Ye4/4XSbPO4cf1MHX7qinN4rjrpsJKPwoYD/gQ1cZvef4AbaKPvq6
oCKRVrwNoUuSB8NTcMLPqze3WCfhnJyVUhCZTyzHeW4uG81qrHwrvBvM25vcWGcm
CcSWFktFIpuGML4FCU3byZfb0NkmJtpCD4n7P98WFPGjvsWIEVCMckqlC8x1F7B7
BqqjGS2mGA17Xy0uLfmN/JempesQJnZhnAnFERdyX1S1YQuKhLwEu7OsYegnStDL
Cn1wFw2/qcgeTkJfBICB
=UToW
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2017-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For 4.11, we seem to have more than in the past few releases:
* socket owner support for connections, so when the wifi
manager (e.g. wpa_supplicant) is killed, connections are
torn down - wpa_supplicant is critical to managing certain
operations, and can opt in to this where applicable
* minstrel & minstrel_ht updates to be more efficient (time and space)
* set wifi_acked/wifi_acked_valid for skb->destructor use in the
kernel, which was already available to userspace
* don't indicate new mesh peers that might be used if there's no
room to add them
* multicast-to-unicast support in mac80211, for better medium usage
(since unicast frames can use *much* higher rates, by ~3 orders of
magnitude)
* add API to read channel (frequency) limitations from DT
* add infrastructure to allow randomizing public action frames for
MAC address privacy (still requires driver support)
* many cleanups and small improvements/fixes across the board
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes two things:
1. Start fast recovery with RACK in addition to other heuristics
(e.g., DUPACK threshold, FACK). Prior to this change RACK
is enabled to detect losses only after the recovery has
started by other algorithms.
2. Disable TCP early retransmit. RACK subsumes the early retransmit
with the new reordering timer feature. A latter patch in this
series removes the early retransmit code.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packets inside a jumbo skb (e.g., TSO) share the same skb
timestamp, even though they are sent sequentially on the wire. Since
RACK is based on time, it can not detect some packets inside the
same skb are lost. However, we can leverage the packet sequence
numbers as extended timestamps to detect losses. Therefore, when
RACK timestamp is identical to skb's timestamp (i.e., one of the
packets of the skb is acked or sacked), we use the sequence numbers
of the acked and unacked packets to break ties.
We can use the same sequence logic to advance RACK xmit time as
well to detect more losses and avoid timeout.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.
It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to
RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.
The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.
When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Record the most recent RTT in RACK. It is often identical to the
"ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
been retransmitted, RACK choses to believe the ACK is for the
(latest) retransmitted packet if the RTT is over minimum RTT.
This requires passing the arrival time of the most recent ACK to
RACK routines. The timestamp is now recorded in the "ack_time"
in tcp_sacktag_state during the ACK processing.
This patch does not change the RACK algorithm itself. It only adds
the RTT variable to prepare the next main patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new helper tcp_rack_detect_loss to prepare the upcoming
RACK reordering timer patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function documentation for cfg80211_connect_bss() and
cfg80211_connect_result() was still claiming that they are used only for
a success case while these functions can now be used to report both
success and various failure cases. The actual use cases were already
described in the connect() documentation.
Update the function specific comments to note the failure cases and also
describe how the special status == -1 case is used in
cfg80211_connect_bss() to indicate a connection timeout based on the
internal implementation in cfg80211_connect_timeout().
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[use tabs for indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This enhances the connect timeout API to also carry the reason for the
timeout. These reason codes for the connect time out are represented by
enum nl80211_timeout_reason and are passed to user space through a new
attribute NL80211_ATTR_TIMEOUT_REASON (u32).
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[keep gfp_t argument last]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Enhance sched scan to support option of finding a better BSS while in
connected state. Firmware scans the medium and reports when it finds a
known BSS which has better RSSI than the current connected BSS. New
attributes to specify the relative RSSI (compared to the current BSS)
are added to the sched scan to implement this.
Signed-off-by: vamsi krishna <vamsin@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With 78, 111 and 85 bytes respectively (on x86-64), the
functions iwe_stream_add_event(), iwe_stream_add_point()
and iwe_stream_add_value() really shouldn't be inlines.
It appears that at least my compiler already decided
the same, and created a single instance of each one
of them for each file using it, but that's still a
number of instances in the system overall, which this
reduces.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a flag that indicates that the WEP ICV was stripped from an
RX packet, allowing the device to not transfer that if it's
already checked.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
gcc-7 complains that wl3501_cs passes NULL into a function that
then uses the argument as the input for memcpy:
drivers/net/wireless/wl3501_cs.c: In function 'wl3501_get_scan':
include/net/iw_handler.h:559:3: error: argument 2 null where non-null expected [-Werror=nonnull]
memcpy(stream + point_len, extra, iwe->u.data.length);
This works fine here because iwe->u.data.length is guaranteed to be 0
and the memcpy doesn't actually have an effect.
Making the length check explicit avoids the warning and should have
no other effect here.
Also check the pointer itself, since otherwise we get warnings
elsewhere in the code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow dissection of (R)ARP operation hardware and protocol addresses
for Ethernet hardware and IPv4 protocol addresses.
There are currently no users of FLOW_DISSECTOR_KEY_ARP.
A follow-up patch will allow FLOW_DISSECTOR_KEY_ARP to be used by the
flower classifier.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm_init_tempstate is always called from within rcu read side section.
We can thus use a simpler function that doesn't call rcu_read_lock
again.
While at it, also make xfrm_init_tempstate return value void, the
return value was never tested.
A followup patch will replace remaining callers of xfrm_state_get_afinfo
with xfrm_state_afinfo_get_rcu variant and then remove the 'old'
get_afinfo interface.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Support for SMC socket monitoring via netlink sockets of protocol
NETLINK_SOCK_DIAG.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Direct call of tcp_set_keepalive() function from protocol-agnostic
sock_setsockopt() function in net/core/sock.c violates network
layering. And newly introduced protocol (SMC-R) will need its own
keepalive function. Therefore, add "keepalive" function pointer
to "struct proto", and call it from sock_setsockopt() via this pointer.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have properly encapsulated and made drivers utilize exported
functions, we can switch dsa_switch_ops to be a annotated with const.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for making struct dsa_switch_ops const, encapsulate it
within a dsa_switch_driver which has a list pointer and a pointer to
dsa_switch_ops. This allows us to take the list_head pointer out of
dsa_switch_ops, which is written to by {un,}register_switch_driver.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disconnect or deauthenticate when the owning socket is closed if this
flag is supplied to CMD_CONNECT or CMD_ASSOCIATE. This may be used
to ensure userspace daemon doesn't leave an unmanaged connection behind.
In some situations it would be possible to account for that, to some
degree, in the deamon restart code or in the up/down scripts without
the use of this attribute. But there will be systems where the daemon
can go away for varying periods without a warning due to local resource
management.
Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The tc_from field fulfills two roles. It encodes whether a packet was
redirected by an act_mirred device and, if so, whether act_mirred was
called on ingress or egress. Split it into separate fields.
The information is needed by the special IFB loop, where packets are
taken out of the normal path by act_mirred, forwarded to IFB, then
reinjected at their original location (ingress or egress) by IFB.
The IFB device cannot use skb->tc_at_ingress, because that may have
been overwritten as the packet travels from act_mirred to ifb_xmit,
when it passes through tc_classify on the IFB egress path. Cache this
value in skb->tc_from_ingress.
That field is valid only if a packet arriving at ifb_xmit came from
act_mirred. Other packets can be crafted to reach ifb_xmit. These
must be dropped. Set tc_redirected on redirection and drop all packets
that do not have this bit set.
Both fields are set only on cloned skbs in tc actions, so original
packet sources do not have to clear the bit when reusing packets
(notably, pktgen and octeon).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Field tc_at is used only within tc actions to distinguish ingress from
egress processing. A single bit is sufficient for this purpose.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract the remaining two fields from tc_verd and remove the __u16
completely. TC_AT and TC_FROM are converted to equivalent two-bit
integer fields tc_at and tc_from. Where possible, use existing
helper skb_at_tc_ingress when reading tc_at. Introduce helper
skb_reset_tc to clear fields.
Not documenting tc_from and tc_at, because they will be replaced
with single bit fields in follow-on patches.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets sent by the IFB device skip subsequent tc classification.
A single bit governs this state. Move it out of tc_verd in
anticipation of removing that __u16 completely.
The new bitfield tc_skip_classify temporarily uses one bit of a
hole, until tc_verd is removed completely in a follow-up patch.
Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
With that many options, little value in documenting it.
Introduce a helper function to deduplicate the logic in the two
sites that check this bit.
The field tc_skip_classify is set only in IFB on skbs cloned in
act_mirred, so original packet sources do not have to clear the
bit when reusing packets (notably, pktgen and octeon).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.
Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp stream reconf, described in RFC 6525, needs a structure to
save per stream information in assoc, like stream state.
In the future, sctp stream scheduler also needs it to save some
stream scheduler params and queues.
This patchset is to prepare the stream array in assoc for stream
reconf. It defines sctp_stream that includes stream arrays inside
to replace ssnmap.
Note that we use different structures for IN and OUT streams, as
the members in per OUT stream will get more and more different
from per IN stream.
v1->v2:
- put these patches into a smaller group.
v2->v3:
- define sctp_stream to contain stream arrays, and create stream.c
to put stream-related functions.
- merge 3 patches into 1, as new sctp_stream has the same name
with before.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_select_default has a single caller within the same file.
Make it static.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a helper for reading that new property and applying
limitations of supported channels specified this way.
It is used with devices that normally support a wide wireless band but
in a given config are limited to some part of it (usually due to board
design). For example a dual-band chipset may be able to support one band
only because of used antennas.
It's also common that tri-band routers have separated radios for lower
and higher part of 5 GHz band and it may be impossible to say which is
which without a DT info.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
[add new function to documentation, fix link]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
udplite was copied from udp, they are virtually 100% identical.
This adds udplite tracker to udp instead, removes udplite module,
and then makes the udplite tracker builtin.
udplite will then simply re-use udp timeout settings.
It makes little sense to add separate sysctls, nowadays we have
fine-grained timeout policy support via the CT target.
old:
text data bss dec hex filename
1633 672 0 2305 901 nf_conntrack_proto_udp.o
1756 672 0 2428 97c nf_conntrack_proto_udplite.o
69526 17937 268 87731 156b3 nf_conntrack.ko
new:
text data bss dec hex filename
2442 1184 0 3626 e2a nf_conntrack_proto_udp.o
68565 17721 268 86554 1521a nf_conntrack.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Different namespace application might require different maximal
number of remembered connection requests.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different namespace application might require fast recycling
TIME-WAIT sockets independently of the host.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to use this cascading. It doesn't add anything.
Let's remove it and simplify.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.
The most important is to not assume the packet is RX just because
the destination address matches that of the device. Such an
assumption causes problems when an interface is put into loopback
mode.
2) If we retry when creating a new tc entry (because we dropped the
RTNL mutex in order to load a module, for example) we end up with
-EAGAIN and then loop trying to replay the request. But we didn't
reset some state when looping back to the top like this, and if
another thread meanwhile inserted the same tc entry we were trying
to, we re-link it creating an enless loop in the tc chain. Fix from
Daniel Borkmann.
3) There are two different WRITE bits in the MDIO address register for
the stmmac chip, depending upon the chip variant. Due to a bug we
could set them both, fix from Hock Leong Kweh.
4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: stmmac: fix incorrect bit set in gmac4 mdio addr register
r8169: add support for RTL8168 series add-on card.
net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
openvswitch: upcall: Fix vlan handling.
ipv4: Namespaceify tcp_tw_reuse knob
net: korina: Fix NAPI versus resources freeing
net, sched: fix soft lockup in tc_classify
net/mlx4_en: Fix user prio field in XDP forward
tipc: don't send FIN message from connectionless socket
ipvlan: fix multicast processing
ipvlan: fix various issues in ipvlan_process_multicast()
Different namespaces might have different requirements to reuse
TIME-WAIT sockets for new connections. This might be required in
cases where different namespace applications are in place which
require TIME_WAIT socket connections to be reduced independently
of the host.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ktime is a union because the initial implementation stored the time in
scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
variant for 32bit machines. The Y2038 cleanup removed the timespec variant
and switched everything to scalar nanoseconds. The union remained, but
become completely pointless.
Get rid of the union and just keep ktime_t as simple typedef of type s64.
The conversion was done with coccinelle and some manual mopping up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes and cleanups from David Miller:
1) Revert bogus nla_ok() change, from Alexey Dobriyan.
2) Various bpf validator fixes from Daniel Borkmann.
3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
drivers, from Dongpo Li.
4) Several ethtool ksettings conversions from Philippe Reynes.
5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert.
6) XDP support for virtio_net, from John Fastabend.
7) Fix NAT handling within a vrf, from David Ahern.
8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits)
net: mv643xx_eth: fix build failure
isdn: Constify some function parameters
mlxsw: spectrum: Mark split ports as such
cgroup: Fix CGROUP_BPF config
qed: fix old-style function definition
net: ipv6: check route protocol when deleting routes
r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected
irda: w83977af_ir: cleanup an indent issue
net: sfc: use new api ethtool_{get|set}_link_ksettings
net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings
net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
bpf: fix mark_reg_unknown_value for spilled regs on map value marking
bpf: fix overflow in prog accounting
bpf: dynamically allocate digest scratch buffer
gtp: Fix initialization of Flags octet in GTPv1 header
gtp: gtp_check_src_ms_ipv4() always return success
net/x25: use designated initializers
isdn: use designated initializers
...
A user may call listen with binding an explicit port with the intent
that the kernel will assign an available port to the socket. In this
case inet_csk_get_port does a port scan. For such sockets, the user may
also set soreuseport with the intent a creating more sockets for the
port that is selected. The problem is that the initial socket being
opened could inadvertently choose an existing and unreleated port
number that was already created with soreuseport.
This patch adds a boolean parameter to inet_bind_conflict that indicates
rather soreuseport is allowed for the check (in addition to
sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
the argument is set to true if an explicit port is being looked up (snum
argument is nonzero), and is false if port scan is done.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull vfs updates from Al Viro:
- more ->d_init() stuff (work.dcache)
- pathname resolution cleanups (work.namei)
- a few missing iov_iter primitives - copy_from_iter_full() and
friends. Either copy the full requested amount, advance the iterator
and return true, or fail, return false and do _not_ advance the
iterator. Quite a few open-coded callers converted (and became more
readable and harder to fuck up that way) (work.iov_iter)
- several assorted patches, the big one being logfs removal
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
logfs: remove from tree
vfs: fix put_compat_statfs64() does not handle errors
namei: fold should_follow_link() with the step into not-followed link
namei: pass both WALK_GET and WALK_MORE to should_follow_link()
namei: invert WALK_PUT logics
namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
namei: saner calling conventions for mountpoint_last()
namei.c: get rid of user_path_parent()
switch getfrag callbacks to ..._full() primitives
make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
[iov_iter] new primitives - copy_from_iter_full() and friends
don't open-code file_inode()
ceph: switch to use of ->d_init()
ceph: unify dentry_operations instances
lustre: switch to use of ->d_init()
Commit 4f7df337fe
"netlink: 2-clause nla_ok()" is BROKEN.
First clause tests if "->nla_len" could even be accessed at all,
it can not possibly be omitted.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comment on the name indirection suggested an issue but turned out
to be untrue. Digging in older kernel version showed issue with ipw2x00
but that is no longer true so get rid on the name indirection.
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
* fix a logic bug introduced by a previous cleanup
* fix nl80211 attribute confusing (trying to use
a single attribute for two purposes)
* fix a long-standing BSS leak that happens when an
association attempt is abandoned
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYSpxnAAoJEGt7eEactAAd3hEP/0RzU5BLTe3FD39i2ESo4fQo
q2Wnaa+ES1Ul473rCuSmPLGzlSjh0GciltHXRu7UEf5zXAjwuQtilrKsI9DizVR8
hgTV4Jp0TDLuDudgxEPlpLxcFWALDaK0AlKuL1dY/FSI1BnNnToEeX8Bum6/otqe
2wLQ11+70HrdNHJjvBEHP/kE/2D55easydmkCS30WYlFrd0BEFtGZ6Leb8deIAzL
qQpanf26jBYVTm7ls+j0bt4mYbb0RLcsLrOS8EgyIYhCsbJHbaC2OpYGTbGxR6ob
KKx01PGVnzytaKXCx/m70923V2mwWZWwa7IgDfoj2IzvsTnfmCgekGdSCiY+DJjE
1jiDYWVK3KgTJQqXRnE1BCbF/FPK6ABKoPgmJBAAiLC48VpmrQwG0OLLQmYVTdp9
KLrQztvZAVV1adA32fGpJHecDyQMMZ2xp7TZn9YY3qAiP4APU8IUscKuSXALmKN9
kMBUBhwkk7QuHZXkry0QFBpFXpOgYjX3vt/gBh8EAmGfyRIklTKtGsmftkuQbWR9
9BN4TbPznEJECqVy/BCL8llHNkfsJgcz3noFOePUjwa4FCAxJst/NFya+IkkqOQ5
eAOj5cjsDfxsrdJFGxIsxXrtGZI1MjwKZf3w6jmu/VVL6BMryxYwtWnwrwcBsit7
nXjitThBO0V2l3Iaf09m
=HvKt
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Three fixes:
* fix a logic bug introduced by a previous cleanup
* fix nl80211 attribute confusing (trying to use
a single attribute for two purposes)
* fix a long-standing BSS leak that happens when an
association attempt is abandoned
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When mac80211 abandons an association attempt, it may free
all the data structures, but inform cfg80211 and userspace
about it only by sending the deauth frame it received, in
which case cfg80211 has no link to the BSS struct that was
used and will not cfg80211_unhold_bss() it.
Fix this by providing a way to inform cfg80211 of this with
the BSS entry passed, so that it can clean up properly, and
use this ability in the appropriate places in mac80211.
This isn't ideal: some code is more or less duplicated and
tracing is missing. However, it's a fairly small change and
it's thus easier to backport - cleanups can come later.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
sk_drops can be an often written field, do not read it unless
application showed interest.
Note that sk_drops can be read via inet_diag, so applications
can avoid getting this info from every received packet.
In the future, 'reading' sk_drops might require folding per node or per
cpu fields, and thus become even more expensive than today.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFS is not commonly used, so add a jump label to avoid some conditionals
in fast path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow dissection of ICMP(V6) type and code. This should only occur
if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set.
There are currently no users of FLOW_DISSECTOR_KEY_ICMP.
A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by
the flower classifier.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains a large Netfilter update for net-next,
to summarise:
1) Add support for stateful objects. This series provides a nf_tables
native alternative to the extended accounting infrastructure for
nf_tables. Two initial stateful objects are supported: counters and
quotas. Objects are identified by a user-defined name, you can fetch
and reset them anytime. You can also use a maps to allow fast lookups
using any arbitrary key combination. More info at:
http://marc.info/?l=netfilter-devel&m=148029128323837&w=2
2) On-demand registration of nf_conntrack and defrag hooks per netns.
Register nf_conntrack hooks if we have a stateful ruleset, ie.
state-based filtering or NAT. The new nf_conntrack_default_on sysctl
enables this from newly created netnamespaces. Default behaviour is not
modified. Patches from Florian Westphal.
3) Allocate 4k chunks and then use these for x_tables counter allocation
requests, this improves ruleset load time and also datapath ruleset
evaluation, patches from Florian Westphal.
4) Add support for ebpf to the existing x_tables bpf extension.
From Willem de Bruijn.
5) Update layer 4 checksum if any of the pseudoheader fields is updated.
This provides a limited form of 1:1 stateless NAT that make sense in
specific scenario, eg. load balancing.
6) Add support to flush sets in nf_tables. This series comes with a new
set->ops->deactivate_one() indirection given that we have to walk
over the list of set elements, then deactivate them one by one.
The existing set->ops->deactivate() performs an element lookup that
we don't need.
7) Two patches to avoid cloning packets, thus speed up packet forwarding
via nft_fwd from ingress. From Florian Westphal.
8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
prevent infinite loops, patch from Dwip Banerjee. And one minor
refactoring from Gao feng.
9) Revisit recent log support for nf_tables netdev families: One patch
to ensure that we correctly handle non-ethernet packets. Another
patch to add missing logger definition for netdev. Patches from
Liping Zhang.
10) Three patches for nft_fib, one to address insufficient register
initialization and another to solve incorrect (although harmless)
byteswap operation. Moreover update xt_rpfilter and nft_fib to match
lbcast packets with zeronet as source, eg. DHCP Discover packets
(0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.
11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
been broken in many-cast mode for some little time, let's give them a
chance by placing them at the same level as other existing protocols.
Thus, users don't explicitly have to modprobe support for this and
NAT rules work for them. Some people point to the lack of support in
SOHO Linux-based routers that make deployment of new protocols harder.
I guess other middleboxes outthere on the Internet are also to blame.
Anyway, let's see if this has any impact in the midrun.
12) Skip software SCTP software checksum calculation if the NIC comes
with SCTP checksum offload support. From Davide Caratti.
13) Initial core factoring to prepare conversion to hook array. Three
patches from Aaron Conole.
14) Gao Feng made a wrong conversion to switch in the xt_multiport
extension in a patch coming in the previous batch. Fix it in this
batch.
15) Get vmalloc call in sync with kmalloc flags to avoid a warning
and likely OOM killer intervention from x_tables. From Marcelo
Ricardo Leitner.
16) Update Arturo Borrero's email address in all source code headers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo noticed a cache line miss in UDP recvmsg() to access
sk_rxhash, sharing a cache line with sk_drops.
sk_drops might be heavily incremented by cpus handling a flood targeting
this socket.
We might place sk_drops on a separate cache line, but lets try
to avoid wasting 64 bytes per socket just for this, since we have
other bottlenecks to take care of.
sock_rps_record_flow() should only access sk_rxhash for connected
flows.
Testing sk_state for TCP_ESTABLISHED covers most of the cases for
connected sockets, for a zero cost, since system calls using
sock_rps_record_flow() also access sk->sk_prot which is on the
same cache line.
A follow up patch will provide a static_key (Jump Label) since most
hosts do not even use RFS.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for set flushing, that consists of walking over
the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
This patch requires the following changes:
1) Add set->ops->deactivate_one() operation: This allows us to
deactivate an element from the set element walk path, given we can
skip the lookup that happens in ->deactivate().
2) Add a new nft_trans_alloc_gfp() function since we need to allocate
transactions using GFP_ATOMIC given the set walk path happens with
held rcu_read_lock.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to refer to stateful objects from set elements.
This provides the infrastructure to create maps where the right hand
side of the mapping is a stateful object.
This allows us to build dictionaries of stateful objects, that you can
use to perform fast lookups using any arbitrary key combination.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag
indicates we have reached overquota.
Add pointer to table from nft_object, so we can use it when sending the
depletion notification to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce nf_tables_obj_notify() to notify internal state changes in
stateful objects. This is used by the quota object to report depletion
in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.
This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.
This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
... so we can use current skb instead of working with a clone.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new flag that signals the kernel to update layer 4
checksum if the packet field belongs to the layer 4 pseudoheader. This
implicitly provides stateless NAT 1:1 that is useful under very specific
usecases.
Since rules mangling layer 3 fields that are part of the pseudoheader
may potentially convey any layer 4 packet, we have to deal with the
layer 4 checksum adjustment using protocol specific code.
This patch adds support for TCP, UDP and ICMPv6, since they include the
pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we
can skip it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.
This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.
Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.
We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.
There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.
The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1) Old code was hard to maintain, due to complex lock chains.
(We probably will be able to remove some kfree_rcu() in callers)
2) Using a single timer to update all estimators does not scale.
3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
is not supposed to work well)
In this rewrite :
- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.
- Each estimator has its own timer.
- Estimations are maintained in net_rate_estimator structure,
instead of dirtying the qdisc. Minor, but part of the simplification.
- Reading the estimator uses RCU and a seqcount to provide proper
support for 32bit kernels.
- We reduce memory need when estimators are not used, since
we store a pointer, instead of the bytes/packets counters.
- xt_rateest_mt() no longer has to grab a spinlock.
(In the future, xt_rateest_tg() could be switched to per cpu counters)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-12-03
Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10
kernel):
- Fix for a potential NULL deref in the ieee802154 netlink code
- Fix for the ED values of the at86rf2xx driver
- Documentation updates to ieee802154
- Cleanups to u8 vs __u8 usage
- Timer API usage cleanups in HCI drivers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.
Gained two 4 bytes holes on 64bit arches.
Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.
I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.
Tested with both TCP and UDP.
UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.
/* --- cacheline 4 boundary (256 bytes) --- */
unsigned int sk_napi_id; /* 0x100 0x4 */
int sk_rcvbuf; /* 0x104 0x4 */
struct sk_filter * sk_filter; /* 0x108 0x8 */
union {
struct socket_wq * sk_wq; /* 0x8 */
struct socket_wq * sk_wq_raw; /* 0x8 */
}; /* 0x110 0x8 */
struct xfrm_policy * sk_policy[2]; /* 0x118 0x10 */
struct dst_entry * sk_rx_dst; /* 0x128 0x8 */
struct dst_entry * sk_dst_cache; /* 0x130 0x8 */
atomic_t sk_omem_alloc; /* 0x138 0x4 */
int sk_sndbuf; /* 0x13c 0x4 */
/* --- cacheline 5 boundary (320 bytes) --- */
int sk_wmem_queued; /* 0x140 0x4 */
atomic_t sk_wmem_alloc; /* 0x144 0x4 */
long unsigned int sk_tsq_flags; /* 0x148 0x8 */
struct sk_buff * sk_send_head; /* 0x150 0x8 */
struct sk_buff_head sk_write_queue; /* 0x158 0x18 */
__s32 sk_peek_off; /* 0x170 0x4 */
int sk_write_pending; /* 0x174 0x4 */
long int sk_sndtimeo; /* 0x178 0x8 */
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This switch (default on) can be used to disable automatic registration
of connection tracking functionality in newly created network
namespaces.
This means that when net namespace goes down (or the tracker protocol
module is unloaded) we *might* have to unregister the hooks.
We can either add another per-netns variable that tells if
the hooks got registered by default, or, alternatively, just call
the protocol _put() function and have the callee deal with a possible
'extra' put() operation that doesn't pair with a get() one.
This uses the latter approach, i.e. a put() without a get has no effect.
Conntrack is still enabled automatically regardless of the new sysctl
setting if the new net namespace requires connection tracking, e.g. when
NAT rules are created.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This makes use of nf_ct_netns_get/put added in previous patch.
We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6
then implement use-count to track how many users (nft or xtables modules)
have a dependency on ipv4 and/or ipv6 connection tracking functionality.
When count reaches zero, the hooks are unregistered.
This delays activation of connection tracking inside a namespace until
stateful firewall rule or nat rule gets added.
This patch breaks backwards compatibility in the sense that connection
tracking won't be active anymore when the protocol tracker module is
loaded. This breaks e.g. setups that ctnetlink for flow accounting and
the like, without any '-m conntrack' packet filter rules.
Followup patch restores old behavour and makes new delayed scheme
optional via sysctl.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
currently aliased to try_module_get/_put.
Will be changed in next patch when we add functions to make use of ->net
argument to store usercount per l3proto tracker.
This is needed to avoid registering the conntrack hooks in all netns and
later only enable connection tracking in those that need conntrack.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
since adf0516845 ("netfilter: remove ip_conntrack* sysctl compat code")
the only user (ipv4 tracker) sets this to an empty stub function.
After this change nf_ct_l3proto_pernet_register() is also empty,
but this will change in a followup patch to add conditional register
of the hooks.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
connection tracking support for UDPlite protocol is built-in into
nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| udplite| ipv4 | ipv6 |nf_conntrack
---------++--------+--------+--------+--------------
none || 432538 | 828755 | 828676 | 6141434
UDPlite || - | 829649 | 829362 | 6498204
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
tracking support for SCTP protocol is built-in into nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| sctp | ipv4 | ipv6 | nf_conntrack
---------++--------+--------+--------+--------------
none || 498243 | 828755 | 828676 | 6141434
SCTP || - | 829254 | 829175 | 6547872
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
tracking support for DCCP protocol is built-in into nf_conntrack.ko.
footprint test:
$ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
net/ipv4/netfilter/nf_conntrack_ipv4.ko \
net/ipv6/netfilter/nf_conntrack_ipv6.ko
(builtin)|| dccp | ipv4 | ipv6 | nf_conntrack
---------++--------+--------+--------+--------------
none || 469140 | 828755 | 828676 | 6141434
DCCP || - | 830566 | 829935 | 6533526
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In netdev family, we will handle non ethernet packets, so using
eth_hdr(skb)->h_proto is incorrect.
Meanwhile, we can use socket(AF_PACKET...) to sending packets, so
skb->protocol is not always set in bridge family.
Add an extra parameter into nf_log_l2packet to solve this issue.
Fixes: 1fddf4bad0 ("netfilter: nf_log: add packet logging for netdev family")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT
support for UDPlite protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) |udplite || nf_nat
--------------------------+--------++--------
no builtin | 408048 || 2241312
UDPLITE builtin | - || 2577256
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT
support for SCTP protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) | sctp || nf_nat
--------------------------+--------++--------
no builtin | 428344 || 2241312
SCTP builtin | - || 2597032
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT
support for DCCP protocol is built-in into nf_nat.ko.
footprint test:
(nf_nat_proto_) | dccp || nf_nat
--------------------------+--------++--------
no builtin | 409800 || 2241312
DCCP builtin | - || 2578968
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.
Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b90eb75494 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.
However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.
Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.
The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.
Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.
The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.
Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().
Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_generic() function is both a) inline and b) used ~600 times.
It has the following code inside
...
ptr = ng->ptr[id - 1];
...
"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.
We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.
Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.
Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.
Code size savings (oh boy): -4.2 KB
As usual, ignore the initial compiler stupidity part of the table.
add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function old new delta
tipc_nametbl_insert_publ 1250 1270 +20
nlmclnt_lookup_host 686 703 +17
nfsd4_encode_fattr 5930 5941 +11
nfs_get_client 1050 1061 +11
register_pernet_operations 333 342 +9
tcf_mirred_init 843 849 +6
tcf_bpf_init 1143 1149 +6
gss_setup_upcall 990 994 +4
idmap_name_to_id 432 434 +2
ops_init 274 275 +1
nfsd_inject_forget_client 259 260 +1
nfs4_alloc_client 612 613 +1
tunnel_key_walker 164 163 -1
...
tipc_bcbase_select_primary 392 360 -32
mac80211_hwsim_new_radio 2808 2767 -41
ipip6_tunnel_ioctl 2228 2186 -42
tipc_bcast_rcv 715 672 -43
tipc_link_build_proto_msg 1140 1089 -51
nfsd4_lock 3851 3796 -55
tipc_mon_rcv 1012 956 -56
Total: Before=156643951, After=156639743, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is precursor to fixing "[id - 1]" bloat inside net_generic().
Name "s" is chosen to complement name "u" often used for dummy unions.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_ok() consists of 3 clauses:
1) int rem >= (int)sizeof(struct nlattr)
2) u16 nla_len >= sizeof(struct nlattr)
3) u16 nla_len <= int rem
The statement is that clause (1) is redundant.
What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.
However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.
NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.
Obligatory space savings report: -1.6 KB
$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
function old new delta
validate_scan_freqs 142 155 +13
tcf_em_tree_validate 867 879 +12
dcbnl_ieee_del 328 338 +10
netlbl_cipsov4_add_common.isra 218 215 -3
...
ovs_nla_put_actions 888 806 -82
netlbl_cipsov4_add_std 1648 1566 -82
nl80211_parse_sched_scan 2889 2780 -109
ip_tun_from_nlattr 3086 2945 -141
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Couple conflicts resolved here:
1) In the MACB driver, a bug fix to properly initialize the
RX tail pointer properly overlapped with some changes
to support variable sized rings.
2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
overlapping with a reorganization of the driver to support
ACPI, OF, as well as PCI variants of the chip.
3) In 'net' we had several probe error path bug fixes to the
stmmac driver, meanwhile a lot of this code was cleaned up
and reorganized in 'net-next'.
4) The cls_flower classifier obtained a helper function in
'net-next' called __fl_delete() and this overlapped with
Daniel Borkamann's bug fix to use RCU for object destruction
in 'net'. It also overlapped with Jiri's change to guard
the rhashtable_remove_fast() call with a check against
tc_skip_sw().
5) In mlx4, a revert bug fix in 'net' overlapped with some
unrelated changes in 'net-next'.
6) In geneve, a stale header pointer after pskb_expand_head()
bug fix in 'net' overlapped with a large reorganization of
the same code in 'net-next'. Since the 'net-next' code no
longer had the bug in question, there was nothing to do
other than to simply take the 'net-next' hunks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add socket family, type and protocol to bpf_sock allowing bpf programs
read-only access.
Add __sk_flags_offset[0] to struct sock before the bitfield to
programmtically determine the offset of the unsigned int containing
protocol and type.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support hardware offloading when the device given by the tc
rule is different from the Hardware underline device, extract the mirred
(egress) device from the tc action when a filter is added, using the new
tc_action_ops, get_dev().
Flower caches the information about the mirred device and use it for
calling ndo_setup_tc in filter change, update stats and delete.
Calling ndo_setup_tc of the mirred (egress) device instead of the
ingress device will allow a resolution between the software ingress
device and the underline hardware device.
The resolution will take place inside the offloading driver using
'egress_device' flag added to tc_to_netdev struct which is provided to
the offloading driver.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support to a new tc_action_ops.
get_dev is a general option which allows to get the underline
device when trying to offload a tc rule.
In case of mirred action the returned device is the mirred (egress)
device.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Creating a difference between two possible cases:
1. Not offloading tc rule since the user sets 'skip_hw' flag.
2. Not offloading tc rule since the device doesn't support offloading.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.
commit ceaa1fef65 ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).
So only two items are left:
- add a tsoffset for request sockets
- extend the tcp isn generator to also return another 32bit number
in addition to the ISN.
Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.
Includes fixes from Eric Dumazet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
This is a large batch of Netfilter fixes for net, they are:
1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist
structure that allows to have several objects with the same key.
Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is
expecting a return value similar to memcmp(). Change location of
the nat_bysource field in the nf_conn structure to avoid zeroing
this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us
to crashes. From Florian Westphal.
2) Don't allow malformed fragments go through in IPv6, drop them,
otherwise we hit GPF, patch from Florian Westphal.
3) Fix crash if attributes are missing in nft_range, from Liping Zhang.
4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia.
5) Two patches from David Ahern to fix netfilter interaction with vrf.
From David Ahern.
6) Fix element timeout calculation in nf_tables, we take milliseconds
from userspace, but we use jiffies from kernelspace. Patch from
Anders K. Pedersen.
7) Missing validation length netlink attribute for nft_hash, from
Laura Garcia.
8) Fix nf_conntrack_helper documentation, we don't default to off
anymore for a bit of time so let's get this in sync with the code.
I know is late but I think these are important, specifically the NAT
bits, as they are mostly addressing fallout from recent changes. I also
read there are chances to have -rc8, if that is the case, that would
also give us a bit more time to test this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Socket flags aren't updated atomically, so the socket must be locked
while reading the SOCK_ZAPPED flag.
This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch
also brings error handling for __ip6_datagram_connect() failures.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch measures TCP busy time, which is defined as the period
of time when sender has data (or FIN) to send. The time starts when
data is buffered and stops when the write queue is flushed by ACKs
or error events.
Note the busy time does not include SYN time, unless data is
included in SYN (i.e. Fast Open). It does include FIN time even
if the FIN carries no payload. Excluding pure FIN is possible but
would incur one additional test in the fast path, which may not
be worth it.
Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the skeleton of the TCP chronograph
instrumentation on sender side limits:
1) idle (unspec)
2) busy sending data other than 3-4 below
3) rwnd-limited
4) sndbuf-limited
The limits are enumerated 'tcp_chrono'. Since a connection in
theory can idle forever, we do not track the actual length of this
uninteresting idle period. For the rest we track how long the sender
spends in each limit. At any point during the life time of a
connection, the sender must be in one of the four states.
If there are multiple conditions worthy of tracking in a chronograph
then the highest priority enum takes precedence over
the other conditions. So that if something "more interesting"
starts happening, stop the previous chrono and start a new one.
The time unit is jiffy(u32) in order to save space in tcp_sock.
This implies application must sample the stats no longer than every
49 days of 1ms jiffy.
Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bluetooth.h is not part of user API, so __ variants are not neccessary
here.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.
Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Some HWs need the VF driver to put part of the packet headers on the
TX descriptor so the e-switch can do proper matching and steering.
The supported modes: none, link, network, transport.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stas Nichiporovich reports oops in nf_nat_bysource_cmp(), trying to
access nf_conn struct at address 0xffffffffffffff50.
This is the result of fetching a null rhash list (struct embedded at
offset 176; 0 - 176 gets us ...fff50).
The problem is that conntrack entries are allocated from a
SLAB_DESTROY_BY_RCU cache, i.e. entries can be free'd and reused
on another cpu while nf nat bysource hash access the same conntrack entry.
Freeing is fine (we hold rcu read lock); zeroing rhlist_head isn't.
-> Move the rhlist struct outside of the memset()-inited area.
Fixes: 7c96643519 ("netfilter: move nat hlist_head to nf_conn")
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As Liping Zhang reports, after commit a8b1e36d0d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.
Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.
Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.
Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.
This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.
Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Anders K. Pedersen <akp@cohaesio.com>
Acked-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I got offlist bug report about failing connections and high cpu usage.
This happens because we hit 'elasticity' checks in rhashtable that
refuses bucket list exceeding 16 entries.
The nat bysrc hash unfortunately needs to insert distinct objects that
share same key and are identical (have same source tuple), this cannot
be avoided.
Switch to the rhlist interface which is designed for this.
The nulls_base is removed here, I don't think its needed:
A (unlikely) false positive results in unneeded port clash resolution,
a false negative results in packet drop during conntrack confirmation,
when we try to insert the duplicate into main conntrack hash table.
Tested by adding multiple ip addresses to host, then adding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
... and then creating multiple connections, from same source port but
different addresses:
for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null & done
(all of these then get hashed to same bysource slot)
Then, to test that nat conflict resultion is working:
nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
nc -s 10.0.0.2 -p 1234 192.168.7.1 2000
tcp .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
tcp .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
tcp .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
[..]
-> nat altered source ports to 1024 and 1025, respectively.
This can also be confirmed on destination host which shows
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1024
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1025
ESTAB 0 0 192.168.7.1:2000 192.168.7.10:1234
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: 870190a9ec ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The hci_get_route() API is used to look up local HCI devices, however
so far it has been incapable of dealing with anything else than the
public address of HCI devices. This completely breaks with LE-only HCI
devices that do not come with a public address, but use a static
random address instead.
This patch exteds the hci_get_route() API with a src_type parameter
that's used for comparing with the right address of each HCI device.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.
It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
descriptor not a pid. Fix the typo.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP busy polling is restricted to connected UDP sockets.
This is because sk_busy_loop() only takes care of one NAPI context.
There are cases where it could be extended.
1) Some hosts receive traffic on a single NIC, with one RX queue.
2) Some applications use SO_REUSEPORT and associated BPF filter
to split the incoming traffic on one UDP socket per RX
queue/thread/cpu
3) Some UDP sockets are used to send/receive traffic for one flow, but
they do not bother with connect()
This patch records the napi_id of first received skb, giving more
reach to busy polling.
Tested:
lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
Before patch :
27867 28870 37324 41060 41215
36764 36838 44455 41282 43843
After patch :
73920 73213 70147 74845 71697
68315 68028 75219 70082 73707
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
to hash a node to one chain. If in one host thousands of assocs connect
to one server with the same lport and different laddrs (although it's
not a normal case), all the transports would be hashed into the same
chain.
It may cause to keep returning -EBUSY when inserting a new node, as the
chain is too long and sctp inserts a transport node in a loop, which
could even lead to system hangs there.
The new rhlist interface works for this case that there are many nodes
with the same key in one chain. It puts them into a list then makes this
list be as a node of the chain.
This patch is to replace rhashtable_ interface with rhltable_ interface.
Since a chain would not be too long and it would not return -EBUSY with
this fix when inserting a node, the reinsert loop is also removed here.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the lwtunnel_headroom() function which is called
in ipv4_mtu() and ip6_mtu(), to also return the correct headroom
value when the lwtunnel state is OUTPUT_REDIRECT.
This patch enables e.g. SR-IPv6 encapsulations to work without
manually setting the route mtu.
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sk_busy_loop() can schedule by itself, we can remove
need_resched() check from sk_can_busy_loop()
Also add a const to its struct sock parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added. Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.
I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main. This way we don't
call it for every rule change which is what was happening previously.
Fixes: 347e3b28c1 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rolf Neugebauer reported very long delays at netns dismantle.
Eric W. Biederman was kind enough to look at this problem
and noticed synchronize_net() occurring from netif_napi_del() that was
added in linux-4.5
Busy polling makes no sense for tunnels NAPI.
If busy poll is used for sessions over tunnels, the poller will need to
poll the physical device queue anyway.
netif_tx_napi_add() could be used here, but function name is misleading,
and renaming it is not stable material, so set NAPI_STATE_NO_BUSY_POLL
bit directly.
This will avoid inserting gro_cells napi structures in napi_hash[]
and avoid the problematic synchronize_net() (per possible cpu) that
Rolf reported.
Fixes: 93d05d4a32 ("net: provide generic busy polling to all NAPI drivers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 850cbaddb5 ("udp: use it's own memory accounting schema")
assumes that the socket proto has memory accounting enabled,
but this is not the case for UDPLITE.
Fix it enabling memory accounting for UDPLITE and performing
fwd allocated memory reclaiming on socket shutdown.
UDP and UDPLITE share now the same memory accounting limits.
Also drop the backlog receive operation, since is no more needed.
Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Reported-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 14135f30e3 ("inet: fix sleeping inside inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().
Switch to the new wait API.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains a second batch of Netfilter updates for
your net-next tree. This includes a rework of the core hook
infrastructure that improves Netfilter performance by ~15% according to
synthetic benchmarks. Then, a large batch with ipset updates, including
a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a
couple of assorted updates.
Regarding the core hook infrastructure rework to improve performance,
using this simple drop-all packets ruleset from ingress:
nft add table netdev x
nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; }
nft add rule netdev x y drop
And generating traffic through Jesper Brouer's
samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i
option. perf report shows nf_tables calls in its top 10:
17.30% kpktgend_0 [nf_tables] [k] nft_do_chain
15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core
10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev
I'm measuring here an improvement of ~15% in performance with this
patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R)
Core(TM) i5-3320M CPU @ 2.60GHz 4-cores.
This rework contains more specifically, in strict order, these patches:
1) Remove compile-time debugging from core.
2) Remove obsolete comments that predate the rcu era. These days it is
well known that a Netfilter hook always runs under rcu_read_lock().
3) Remove threshold handling, this is only used by br_netfilter too.
We already have specific code to handle this from br_netfilter,
so remove this code from the core path.
4) Deprecate NF_STOP, as this is only used by br_netfilter.
5) Place nf_state_hook pointer into xt_action_param structure, so
this structure fits into one single cacheline according to pahole.
This also implicit affects nftables since it also relies on the
xt_action_param structure.
6) Move state->hook_entries into nf_queue entry. The hook_entries
pointer is only required by nf_queue(), so we can store this in the
queue entry instead.
7) use switch() statement to handle verdict cases.
8) Remove hook_entries field from nf_hook_state structure, this is only
required by nf_queue, so store it in nf_queue_entry structure.
9) Merge nf_iterate() into nf_hook_slow() that results in a much more
simple and readable function.
10) Handle NF_REPEAT away from the core, so far the only client is
nf_conntrack_in() and we can restart the packet processing using a
simple goto to jump back there when the TCP requires it.
This update required a second pass to fix fallout, fix from
Arnd Bergmann.
11) Set random seed from nft_hash when no seed is specified from
userspace.
12) Simplify nf_tables expression registration, in a much smarter way
to save lots of boiler plate code, by Liping Zhang.
13) Simplify layer 4 protocol conntrack tracker registration, from
Davide Caratti.
14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due
to recent generalization of the socket infrastructure, from Arnd
Bergmann.
15) Then, the ipset batch from Jozsef, he describes it as it follows:
* Cleanup: Remove extra whitespaces in ip_set.h
* Cleanup: Mark some of the helpers arguments as const in ip_set.h
* Cleanup: Group counter helper functions together in ip_set.h
* struct ip_set_skbinfo is introduced instead of open coded fields
in skbinfo get/init helper funcions.
* Use kmalloc() in comment extension helper instead of kzalloc()
because it is unnecessary to zero out the area just before
explicit initialization.
* Cleanup: Split extensions into separate files.
* Cleanup: Separate memsize calculation code into dedicated function.
* Cleanup: group ip_set_put_extensions() and ip_set_get_extensions()
together.
* Add element count to hash headers by Eric B Munson.
* Add element count to all set types header for uniform output
across all set types.
* Count non-static extension memory into memsize calculation for
userspace.
* Cleanup: Remove redundant mtype_expire() arguments, because
they can be get from other parameters.
* Cleanup: Simplify mtype_expire() for hash types by removing
one level of intendation.
* Make NLEN compile time constant for hash types.
* Make sure element data size is a multiple of u32 for the hash set
types.
* Optimize hash creation routine, exit as early as possible.
* Make struct htype per ipset family so nets array becomes fixed size
and thus simplifies the struct htype allocation.
* Collapse same condition body into a single one.
* Fix reported memory size for hash:* types, base hash bucket structure
was not taken into account.
* hash:ipmac type support added to ipset by Tomasz Chilinski.
* Use setup_timer() and mod_timer() instead of init_timer()
by Muhammad Falak R Wani, individually for the set type families.
16) Remove useless connlabel field in struct netns_ct, patch from
Florian Westphal.
17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify
{ip,ip6,arp}tables code that uses this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
since 23014011ba ('netfilter: conntrack: support a fixed size of 128 distinct labels')
this isn't needed anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()
Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.
We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq
Many thanks to syzkaller team and Marco for giving us a reproducer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The idr_alloc(), idr_remove(), et al. routines all expect IDs to be
signed integers. Therefore make the genl_family member 'id' signed
too.
Signed-off-by: David S. Miller <davem@davemloft.net>
tc_act macro addressed a non existing field, and was not used in the
kernel source.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares for insertion of SRH through setsockopt().
The new source address argument is used when an HMAC field is
present in the SRH, which must be filled. The HMAC signature
process requires the source address as input text.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary functions to compute and check the HMAC signature
of an SR-enabled packet. Two HMAC algorithms are supported: hmac(sha1) and
hmac(sha256).
In order to avoid dynamic memory allocation for each HMAC computation,
a per-cpu ring buffer is allocated for this purpose.
A new per-interface sysctl called seg6_require_hmac is added, allowing a
user-defined policy for processing HMAC-signed SR-enabled packets.
A value of -1 means that the HMAC field will always be ignored.
A value of 0 means that if an HMAC field is present, its validity will
be enforced (the packet is dropped is the signature is incorrect).
Finally, a value of 1 means that any SR-enabled packet that does not
contain an HMAC signature or whose signature is incorrect will be dropped.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a new type of interfaceless lightweight tunnel (SEG6),
enabling the encapsulation and injection of SRH within locally emitted
packets and forwarded packets.
>From a configuration viewpoint, a seg6 tunnel would be configured as follows:
ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0
Any packet whose destination address is fc00::1 would thus be encapsulated
within an outer IPv6 header containing the SRH with three segments, and would
actually be routed to the first segment of the list. If `mode inline' was
specified instead of `mode encap', then the SRH would be directly inserted
after the IPv6 header without outer encapsulation.
The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This
feature was made configurable because direct header insertion may break
several mechanisms such as PMTUD or IPSec AH.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the necessary hooks and structures to provide support
for SR-IPv6 control plane, essentially the Generic Netlink commands
that will be used for userspace control over the Segment Routing
kernel structures.
The genetlink commands provide control over two different structures:
tunnel source and HMAC data. The tunnel source is the source address
that will be used by default when encapsulating packets into an
outer IPv6 header + SRH. If the tunnel source is set to :: then an
address of the outgoing interface will be selected as the source.
The HMAC commands currently just return ENOTSUPP and will be implemented
in a future patch.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement minimal support for processing of SR-enabled packets
as described in
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-02.
This patch implements the following operations:
- Intermediate segment endpoint: incrementation of active segment and rerouting.
- Egress for SR-encapsulated packets: decapsulation of outer IPv6 header + SRH
and routing of inner packet.
- Cleanup flag support for SR-inlined packets: removal of SRH if we are the
penultimate segment endpoint.
A per-interface sysctl seg6_enabled is provided, to accept/deny SR-enabled
packets. Default is deny.
This patch does not provide support for HMAC-signed packets.
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains a larger than usual batch of Netfilter
fixes for your net tree. This series contains a mixture of old bugs and
recently introduced bugs, they are:
1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't
support the set element updates from the packet path. From Liping
Zhang.
2) Fix leak when nft_expr_clone() fails, from Liping Zhang.
3) Fix a race when inserting new elements to the set hash from the
packet path, also from Liping.
4) Handle segmented TCP SIP packets properly, basically avoid that the
INVITE in the allow header create bogus expectations by performing
stricter SIP message parsing, from Ulrich Weber.
5) nft_parse_u32_check() should return signed integer for errors, from
John Linville.
6) Fix wrong allocation instead of connlabels, allocate 16 instead of
32 bytes, from Florian Westphal.
7) Fix compilation breakage when building the ip_vs_sync code with
CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann.
8) Destroy the new set if the transaction object cannot be allocated,
also from Liping Zhang.
9) Use device to route duplicated packets via nft_dup only when set by
the user, otherwise packets may not follow the right route, again
from Liping.
10) Fix wrong maximum genetlink attribute definition in IPVS, from
WANG Cong.
11) Ignore untracked conntrack objects from xt_connmark, from Florian
Westphal.
12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC
via CT target, otherwise we cannot use the h.245 helper, from
Florian.
13) Revisit garbage collection heuristic in the new workqueue-based
timer approach for conntrack to evict objects earlier, again from
Florian.
14) Fix crash in nf_tables when inserting an element into a verdict map,
from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
modify registration and deregistration of layer-4 protocol trackers to
facilitate inclusion of new elements into the current list of builtin
protocols. Both builtin (TCP, UDP, ICMP) and non-builtin (DCCP, GRE, SCTP,
UDPlite) layer-4 protocol trackers usually register/deregister themselves
using consecutive calls to nf_ct_l4proto_{,pernet}_{,un}register(...).
This sequence is interrupted and rolled back in case of error; in order to
simplify addition of builtin protocols, the input of the above functions
has been modified to allow registering/unregistering multiple protocols.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Install the callbacks via the state machine. Use multi state support to avoid
custom list handling for the multiple instances.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20161103145021.28528-10-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some basic expressions are built into nf_tables.ko, such as nft_cmp,
nft_lookup, nft_range and so on. But these basic expressions' init
routine is a little ugly, too many goto errX labels, and we forget
to call nft_range_module_exit in the exit routine, although it is
harmless.
Acctually, the init and exit routines of these basic expressions
are same, i.e. do nft_register_expr in the init routine and do
nft_unregister_expr in the exit routine.
So it's better to arrange them into an array and deal with them
together.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst
utility functions.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IP tunneling classification supports only IP addresses and key.
Enhance UDP based IP tunneling classification parameters by adding UDP
src and dst port.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New encapsulation keys were added to the flower classifier, which allow
classification according to outer (encapsulation) headers attributes
such as key and IP addresses.
In order to expose those attributes outside flower, add
corresponding enums in the flow dissector.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed for drivers to pick the relevant action when offloading tunnel
key act.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.
Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted. This
value later caused much heartache for the bufferbloat community.
This patch merely moves the value into a defined constant.
[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.
Overall, this allows acquiring only once the receive queue
lock on dequeue.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450
v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can use it even after orphaining the skbuff.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.
Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
rtnetlink. The value INVALID_UID indicates no UID was
specified.
- Add a UID field to the flow structures.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.
Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().
Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:
1. Whenever sk_socket is non-null: sk_uid is the same as the UID
in sk_socket, i.e., matches the return value of sock_i_uid.
Specifically, the UID is set when userspace calls socket(),
fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
- For a socket that no longer has a sk_socket because
userspace has called close(): the previous UID.
- For a cloned socket (e.g., an incoming connection that is
established but on which userspace has not yet called
accept): the UID of the socket it was cloned from.
- For a socket that has never had an sk_socket: UID 0 inside
the user namespace corresponding to the network namespace
the socket belongs to.
Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.
Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.
Fixes: c7ba65d7b6 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.
Tested:
Sent data over a veth pair of which the source has a small mtu.
Sent data using netcat, received using a dedicated process.
Verified that the cmsg IP_RECVFRAGSIZE is returned only when
data arrives fragmented, and in that cases matches the veth mtu.
ip link add veth0 type veth peer name veth1
ip netns add from
ip netns add to
ip link set dev veth1 netns to
ip netns exec to ip addr add dev veth1 192.168.10.1/24
ip netns exec to ip link set dev veth1 up
ip link set dev veth0 netns from
ip netns exec from ip addr add dev veth0 192.168.10.2/24
ip netns exec from ip link set dev veth0 up
ip netns exec from ip link set dev veth0 mtu 1300
ip netns exec from ethtool -K veth0 ufo off
dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload
ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload
using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This field is only useful for nf_queue, so store it in the
nf_queue_entry structure instead, away from the core path. Pass
hook_head to nf_hook_slow().
Since we always have a valid entry on the first iteration in
nf_iterate(), we can use 'do { ... } while (entry)' loop instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't copy relevant fields from hook state structure, instead use the
one that is already available in struct xt_action_param.
This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Place pointer to hook state in xt_action_param structure instead of
copying the fields that we need. After this change xt_action_param fits
into one cacheline.
This patch also adds a set of new wrapper functions to fetch relevant
hook state structure fields.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb->cb may contain data from previous layers. In the observed scenario,
the garbage data were misinterpreted as IP6CB(skb)->frag_max_size, so
that small packets sent through the tunnel are mistakenly fragmented.
This patch unconditionally clears the control buffer in ip6tunnel_xmit(),
which affects ip6_tunnel, ip6_udp_tunnel and ip6_gre. Currently none of
these tunnels set IP6CB(skb)->flags, otherwise it needs to be done earlier.
Cc: stable@vger.kernel.org
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This includes better integration with the routing subsystem for
nf_tables, explicit notrack support and smaller updates. More
specifically, they are:
1) Add fib lookup expression for nf_tables, from Florian Westphal. This
new expression provides a native replacement for iptables addrtype
and rp_filter matches. This is more flexible though, since we can
populate the kernel flowi representation to inquire fib to
accomodate new usecases, such as RTBH through skb mark.
2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
new expression allow you to access skbuff route metadata, more
specifically nexthop and classid fields.
3) Add notrack support for nf_tables, to skip conntracking, requested by
many users already.
4) Add boilerplate code to allow to use nf_log infrastructure from
nf_tables ingress.
5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
the xtables cluster match, from Liping Zhang.
6) Move socket lookup code into generic nf_socket_* infrastructure so
we can provide a native replacement for the xtables socket match.
7) Make sure nfnetlink_queue data that is updated on every packets is
placed in a different cache from read-only data, from Florian Westphal.
8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.
9) Start round robin number generation in nft_numgen from zero,
instead of n-1, for consistency with xtables statistics match,
patch from Liping Zhang.
10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
given we retry with a smaller allocation on failure, from Calvin Owens.
11) Cleanup xt_multiport to use switch(), from Gao feng.
12) Remove superfluous check in nft_immediate and nft_cmp, from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Move layer 2 packet logging into nf_log_l2packet() that resides in
nf_log_common.c, so this can be shared by both bridge and netdev
families.
This patch adds the boiler plate code to register the netdev logging
family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
just dispatches to ipv4 or ipv6 one based on nfproto).
Currently supports fetching output interface index/name and the
rtm_type associated with an address.
This can be used for adding path filtering. rtm_type is useful
to e.g. enforce a strong-end host model where packets
are only accepted if daddr is configured on the interface the
packet arrived on.
The fib expression is a native nftables alternative to the
xtables addrtype and rp_filter matches.
FIB result order for oif/oifname retrieval is as follows:
- if packet is local (skb has rtable, RTF_LOCAL set, this
will also catch looped-back multicast packets), set oif to
the loopback interface.
- if fib lookup returns an error, or result points to local,
store zero result. This means '--local' option of -m rpfilter
is not supported. It is possible to use 'fib type local' or add
explicit saddr/daddr matching rules to create exceptions if this
is really needed.
- store result in the destination register.
In case of multiple routes, search set for desired oif in case
strict matching is requested.
ipv4 and ipv6 behave fib expressions are supposed to behave the same.
[ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")
http://patchwork.ozlabs.org/patch/688615/
to address fallout from this patch after rebasing nf-next, that was
posted to address compilation warnings. --pablo ]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Systems with large pages (64KB pages for example) do not always have
huge quantity of memory.
A big SK_MEM_QUANTUM value leads to fewer interactions with the
global counters (like tcp_memory_allocated) but might trigger
memory pressure much faster, giving suboptimal TCP performance
since windows are lowered to ridiculous values.
Note that sysctl_mem units being in pages and in ABI, we also need
to change sk_prot_mem_limits() accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch, in rx path, before calling lock_sock, it needed to
hold assoc when got it by __sctp_lookup_association, in case other place
would free/put assoc.
But in __sctp_lookup_association, it lookup and hold transport, then got
assoc by transport->assoc, then hold assoc and put transport. It means
it didn't hold transport, yet it was returned and later on directly
assigned to chunk->transport.
Without the protection of sock lock, the transport may be freed/put by
other places, which would cause a use-after-free issue.
This patch is to fix this issue by holding transport instead of assoc.
As holding transport can make sure to access assoc is also safe, and
actually it looks up assoc by searching transport rhashtable, to hold
transport here makes more sense.
Note that the function will be renamed later on on another patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple overlapping changes.
For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Lots of fixes, mostly drivers as is usually the case.
1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey
Khoroshilov.
2) Fix element timeouts in netfilter's nft_dynset, from Anders K.
Pedersen.
3) Don't put aead_req crypto struct on the stack in mac80211, from
Ard Biesheuvel.
4) Several uninitialized variable warning fixes from Arnd Bergmann.
5) Fix memory leak in cxgb4, from Colin Ian King.
6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann.
7) Several VRF semantic fixes from David Ahern.
8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper.
9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet.
10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev.
11) Fix stale link state during failover in NCSCI driver, from Gavin
Shan.
12) Fix netdev lower adjacency list traversal, from Ido Schimmel.
13) Propvide proper handle when emitting notifications of filter
deletes, from Jamal Hadi Salim.
14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen.
15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac.
16) Several routing offload fixes in mlxsw driver, from Jiri Pirko.
17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy.
18) Validate chunk len before using it in SCTP, from Marcelo Ricardo
Leitner.
19) Revert a netns locking change that causes regressions, from Paul
Moore.
20) Add recursion limit to GRO handling, from Sabrina Dubroca.
21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon.
22) Avoid accessing stale vxlan/geneve socket in data path, from
Pravin Shelar"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits)
geneve: avoid using stale geneve socket.
vxlan: avoid using stale vxlan socket.
qede: Fix out-of-bound fastpath memory access
net: phy: dp83848: add dp83822 PHY support
enic: fix rq disable
tipc: fix broadcast link synchronization problem
ibmvnic: Fix missing brackets in init_sub_crq_irqs
ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context
Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context"
arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
net/mlx4_en: Save slave ethtool stats command
net/mlx4_en: Fix potential deadlock in port statistics flow
net/mlx4: Fix firmware command timeout during interrupt test
net/mlx4_core: Do not access comm channel if it has not yet been initialized
net/mlx4_en: Fix panic during reboot
net/mlx4_en: Process all completions in RX rings after port goes up
net/mlx4_en: Resolve dividing by zero in 32-bit system
net/mlx4_core: Change the default value of enable_qos
net/mlx4_core: Avoid setting ports to auto when only one port type is supported
net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec
...
When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* client FILS authentication support in mac80211 (Jouni)
* AP/VLAN multicast improvements (Michael Braun)
* config/advertising support for differing beacon intervals on
multiple virtual interfaces (Purushottam Kushwaha, myself)
* deprecate the old WDS mode for cfg80211-based drivers, the
mode is hardly usable since it doesn't support any "modern"
features like WPA encryption (2003), HT (2009) or VHT (2014),
I'm not even sure WEP (introduced in 1997) could be done.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYEy/9AAoJEGt7eEactAAd/V0P/0FmHGS8HjlSjm+1p6sbWKbt
5v8bb3cuKHQiYiUM6euIXql2OYuOEHVAQEpNoPXN9CsfKFYgbIH6yW6d8HtKNedV
n9lmMy/U6yJX9nYt7yMIQ3kLkbEg+YU58B9Hf47waWXLLSNVumS8rfNBn43EoNQf
VKWYPWpetsCRIWJ1fnLuxvMHCtOOYCxH+491BUonof32+DKPEAsAbnszZ2ElufTR
7KNyA3K6leOtTd5Ml52dvLOGNc+h2C83VAMxiShq/6r8OnlX5tPifaubzd9n3m41
jiJJH/92ESrtF2AaWEm8slcgtcfHS/O7y/FSoV4r0PMSvPTBdjwQ9nqCsbONd831
vjj6c6YWNxgHPcISX0XcWz+FHnLJdUGaDUtHjAJYw4oH4gaRXwfSw0U+jvdlSMUf
2CBUArk5f0OEguzwa/5X4Jio3OPPIj4jY/lKplcpLOUu8K2FWTLDuIlww/FHXovs
rDzTLQeXZkx+MkszkTJN42qSEfOFly91J6OA2Wju+emBqrLIbkAGmvyLVg8U8BQd
gG7oltgmZ6Xg6fEnUQqpIDO7UJlQ+GXAU04SpNDMv1j/ueUJskxlr3hYM7E9ueQv
LJcZcVV0RAwNRw52cEsdcYCMLuSMYRrO1OHlkl0wd+x2hFrUCWnVzUgLEUhNBV+c
ICmNMr96nKhrZI217yzF
=SjfQ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Among various cleanups and improvements, we have the following:
* client FILS authentication support in mac80211 (Jouni)
* AP/VLAN multicast improvements (Michael Braun)
* config/advertising support for differing beacon intervals on
multiple virtual interfaces (Purushottam Kushwaha, myself)
* deprecate the old WDS mode for cfg80211-based drivers, the
mode is hardly usable since it doesn't support any "modern"
features like WPA encryption (2003), HT (2009) or VHT (2014),
I'm not even sure WEP (introduced in 1997) could be done.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* a fix to process all events while suspending, so any
potential calls into the driver are done before it is
suspended
* small markup fixes for the sphinx documentation conversion
that's coming into the tree via the doc tree
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYEbSGAAoJEGt7eEactAAdOo4P/A1RyOl2uTspwB4lZFMMA5Tp
IIwaB+WlNcjGq05C8kemdQJ30AKJUy6aKMTb5FIuFXWLRK6u5uIoQsYe6cRI8WBs
bgTpo5RMcNGx6dUgFS/XwM2bLXUNXAXOt+43c2kWSjnqQZpQuSXh8/4R8ybZxyZB
SNyKCzkx1LYPox4Qnufmi+pz/hI8hHRY7gMdyhUWU2uUk9u/nuY0Zxyvj4OFgyow
SOGasCPoDzT9556hUGyG9M8HmwBiRZUzfzo5mSR2V0a9Tij4ZOvoSkcAUDbj38fZ
0UPJ48xwUDMF88W+NEO+rrd1IQXtlsHJk6x/mAQpSkEvzymb5BEvraVeTV3+chgo
fn45ZDP/GbTwqJ3YacSmKgyGwaGPF9XN2+Iuxotp0UeuYxvxwnkWNr0JbRecDPau
ZL/P2qV7Y3gjy0+6nHafrYIYPuPFZjwPIJ7k6x7mVtO6zfJhtKF57YP/YoFYD78l
TtCEE5o7lAYkdWKa2nm4Fg16NM897PS/GTM9dQCFtnYLV7ynZPY2TGf4TelqJ3cz
FU1/vlM0t/jRkkRKIhlmy8re4ASjTEccf//dllz1CDyPdxHY1NmGftftz539s7qW
NpCvvO0IXDchBT1jeoSMwcMzlZSfQeIKtkp1P0z4lyPiMcmekdgUQ7sb84FhwSa+
SYRpiZ/Xv6xa5Djhonci
=toQY
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-10-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two fixes:
* a fix to process all events while suspending, so any
potential calls into the driver are done before it is
suspended
* small markup fixes for the sphinx documentation conversion
that's coming into the tree via the doc tree
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Per listen(fd, backlog) rules, there is really no point accepting a SYN,
sending a SYNACK, and dropping the following ACK packet if accept queue
is full, because application is not draining accept queue fast enough.
This behavior is fooling TCP clients that believe they established a
flow, while there is nothing at server side. They might then send about
10 MSS (if using IW10) that will be dropped anyway while server is under
stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap several common instances of:
kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL);
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4, do not consider link state when validating next hops.
Currently, if the link is down default routes can fail to insert:
$ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
RTNETLINK answers: No route to host
With this patch the command succeeds.
Fixes: 8c14586fc3 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:
ICMPv6: RA: ndisc_router_discovery failed to add default route
Fix the 'get' functions by using the table id associated with the
device when applicable.
Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since generic netlink family IDs are small integers, allocated
densely, IDR is an ideal match for lookups. Replace the existing
hand-written hash-table with IDR for allocation and lookup.
This lets the families only be written to once, during register,
since the list_head can be removed and removal of a family won't
cause any writes.
It also slightly reduces the code size (by about 1.3k on x86-64).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.
This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.
Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)
Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user may want to use only some bits of the skb mark in
his skbedit rules because the remaining part might be used by
something else.
Introduce the "mask" parameter to the skbedit actor in order
to implement such functionality.
When the mask is specified, only those bits selected by the
latter are altered really changed by the actor, while the
rest is left untouched.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is now a fixed-size extension, so we don't need to pass a variable
alloc size. This (harmless) error results in allocating 32 instead of
the needed 16 bytes for this extension as the size gets passed twice.
Fixes: 23014011ba ("netfilter: conntrack: support a fixed size of 128 distinct labels")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 36b701fae1 ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".
This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.
Found by Coverity, CID 1373930.
Note that commit 21a9e0f156 ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Laura Garcia Liebana <nevola@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 36b701fae1 ("netfilter: nf_tables: validate maximum value...")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When nft_expr_clone failed, a series of problems will happen:
1. module refcnt will leak, we call __module_get at the beginning but
we forget to put it back if ops->clone returns fail
2. memory will be leaked, if clone fail, we just return NULL and forget
to free the alloced element
3. set->nelems will become incorrect when set->size is specified. If
clone fail, we should decrease the set->nelems
Now this patch fixes these problems. And fortunately, clone fail will
only happen on counter expression when memory is exhausted.
Fixes: 086f332167 ("netfilter: nf_tables: add clone interface to expression operations")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add functionality to update the connection parameters when in connected
state, so that driver/firmware uses the updated parameters for
subsequent roaming. This is for drivers that support internal BSS
selection and roaming. The new command does not change the current
association state, i.e., it can be used to update IE contents for future
(re)associations without causing an immediate disassociation or
reassociation with the current BSS.
This commit implements the required functionality for updating IEs for
(Re)Association Request frame only. Other parameters can be added in
future when required.
Signed-off-by: vamsi krishna <vamsin@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the ability to configure if an AP (and associated VLANs) will
do multicast-to-unicast conversion for ARP, IPv4 and IPv6 frames
(possibly within 802.1Q). If enabled, such frames are to be sent
to each station separately, with the DA replaced by their own MAC
address rather than the group address.
Note that this may break certain expectations of the receiver,
such as the ability to drop unicast IP packets received within
multicast L2 frames, or the ability to not send ICMP destination
unreachable messages for packets received in L2 multicast (which
is required, but the receiver can't tell the difference if this
new option is enabled.)
This also doesn't implement the 802.11 DMS (directed multicast
service).
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[fix disabling, add better documentation & commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The new nl80211 attributes can be used to provide KEK and nonces to
allow the driver to encrypt and decrypt FILS (Re)Association
Request/Response frames in station mode.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Define the Element IDs and Element ID Extensions from IEEE
P802.11ai/D11.0. In addition, add a new cfg80211_find_ext_ie() wrapper
to make it easier to find information elements that used the Element ID
Extension field.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This adds defines and nl80211 extensions to allow FILS Authentication to
be implemented similarly to SAE. FILS does not need the special rules
for the Authentication transaction number and Status code fields, but it
does need to add non-IE fields. The previously used
NL80211_ATTR_SAE_DATA can be reused for this to avoid having to
duplicate that implementation. Rename that attribute to more generic
NL80211_ATTR_AUTH_DATA (with backwards compatibility define for
NL80211_SAE_DATA).
Also document the special rules related to the Authentication
transaction number and Status code fiels.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove the pointless checking against interface combinations in
the initial basic beacon interval validation, that currently isn't
taking into account radar detection or channels properly. Instead,
just validate the basic range there, and then delay real checking
to the interface combination validation that drivers must do.
This means that drivers wanting to use the beacon_int_min_gcd will
now have to pass the new_beacon_int when validating the AP/mesh
start.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a helper using wdev to check if interface is running. This
deals with both non-netdev and netdev interfaces. In struct
wireless_dev replace 'p2p_started' and 'nan_started' by
'is_running' as those are mutually exclusive anyway, and unify
all the code to use wdev_running().
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
First bug was added in commit ad6f939ab1 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
Then commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.
Fixes: ad6f939ab1 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skbuff and sock structure both had missing parameter annotation
values.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.
v2:
- add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
- implement @destroy for diag requests (by dsa@)
v3:
- add export of raw_abort for IPv6 (by dsa@)
- pass net-admin flag into inet_sk_diag_fill due to
changes in net-next branch (by dsa@)
v4:
- use @pad in struct inet_diag_req_v2 for raw socket
protocol specification: raw module carries sockets
which may have custom protocol passed from socket()
syscall and sole @sdiag_protocol is not enough to
match underlied ones
- start reporting protocol specifed in socket() call
when sockets are raw ones for the same reason: user
space tools like ss may parse this attribute and use
it for socket matching
v5 (by eric.dumazet@):
- use sock_hold in raw_sock_get instead of atomic_inc,
we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
when looking up so counter won't be zero here.
v6:
- use sdiag_raw_protocol() helper which will access @pad
structure used for raw sockets protocol specification:
we can't simply rename this member without breaking uapi
v7:
- sine sdiag_raw_protocol() helper is not suitable for
uapi lets rather make an alias structure with proper
names. __check_inet_diag_req_raw helper will catch
if any of structure unintentionally changed.
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The field is initialized by ILA and MPLS but never used. Remove it.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.
On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.
On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
the work for us
The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().
v5 -> v6:
- don't orphan the skb on enqueue
v4 -> v5:
- replace the mem_lock with the receive queue spin lock
- ensure that the bh is always allowed to enqueue at least
a skb, even if sk_rcvbuf is exceeded
v3 -> v4:
- reworked memory accunting, simplifying the schema
- provide an helper for both memory scheduling and enqueuing
v1 -> v2:
- use a udp specific destrctor to perform memory reclaiming
- remove a couple of helpers, unneeded after the above cleanup
- do not reclaim memory on dequeue if not under memory
pressure
- reworked the fwd accounting schema to avoid potential
integer overflow
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.
v4 -> v5:
- avoid whitespace changes
v2 -> v4:
- avoid exporting __sock_enqueue_skb
v1 -> v2:
- avoid export sock_rmem_free
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Baozeng reported this deadlock case:
CPU0 CPU1
---- ----
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
lock([ 165.136033] sk_lock-AF_INET6);
lock([ 165.136033] rtnl_mutex);
Similar to commit 87e9f03159
("ipv4: fix a potential deadlock in mcast getsockopt() path")
this is due to we still have a case, ipv6_sock_mc_close(),
where we acquire sk_lock before rtnl_lock. Close this deadlock
with the similar solution, that is always acquire rtnl lock first.
Fixes: baf606d9c9 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Baozeng Ding reported KASAN traces showing uses after free in
udp_lib_get_port() and other related UDP functions.
A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
I could write a reproducer with two threads doing :
static int sock_fd;
static void *thr1(void *arg)
{
for (;;) {
connect(sock_fd, (const struct sockaddr *)arg,
sizeof(struct sockaddr_in));
}
}
static void *thr2(void *arg)
{
struct sockaddr_in unspec;
for (;;) {
memset(&unspec, 0, sizeof(unspec));
connect(sock_fd, (const struct sockaddr *)&unspec,
sizeof(unspec));
}
}
Problem is that udp_disconnect() could run without holding socket lock,
and this was causing list corruptions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a disconnect request from userspace, cfg80211 currently calls
called rdev_disconnect() only in case that 'current_bss' was set,
i.e. connection had been established.
Change this to allow the userspace call to succeed and call the
driver's disconnect() method also while the connection attempt is
in progress, to be able to abort attempts.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[change commit subject/message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The uapsd_queue field is in QoS IE order and not in
IEEE80211_AC_*'s order.
This means that mac80211 would get confused between
BK and BE which is certainly not such a big deal but
needs to be fixed.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently mac80211 determines whether HW does fragmentation
by checking whether the set_frag_threshold callback is set
or not.
However, some drivers may want to set the HW fragmentation
capability depending on HW generation.
Allow this by checking a HW flag instead of checking the
callback.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
[added the flag to ath10k and wlcore]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
iwlwifi will check internally that the tid maps to an AC
that is trigger enabled, but can't know what tid exactly.
Allow the driver to pass a generic tid and make mac80211
assume that a trigger frame was received.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The values don't match the radiotap spec, corrected that.
Reported-by: Oz Shalev <oz.shalev@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.
Aside from that, we have:
* a fix for AP_VLAN usage with the nl80211 frame command
* two fixes (and two preparation patches) for A-MSDU, one
to discard group-addressed (multicast) and unexpected
4-address A-MSDUs, the other to validate A-MSDU inner
MAC addresses properly to prevent controlled port bypass
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJYBcgKAAoJEGt7eEactAAdhUkP/jMVQbLMZ1Jcc9+lsPVGUIga
I9GeQ4lcnD+4ASeJUhTtemC1IMNL4zMVqaIxbznDXKP7rZRrODVvCPk2TYIw9c5S
rzF/TRierMFttLu3xY757nAsYg6T7F03JdOQ3SKIb3xOD8pXCWQoVRN14ldroRno
4stOAtDrpD5wvK2JhlWv1EYlxGVLqLcakZt/BwgDX/cJGkAx49Q/s29FUnesB9Ep
sCH5chffeQskOL9CrSwboNmucgt4HGQORc4UL/KtPOEBtyfu/LCXEKSqAKVyQZtZ
OerouOHWqQE5lT2K6qD/KKFW4lV2t1h+xzqsvZk4ZR5o3s+PAGai6D/wf+JgY9Hk
uor9ju/e0htcI9m0aFdHDnltV0OOwIhR2bxWTuBBUkyFVtdQQY+1MRTTtuunWIB4
SDYv6LrNL/0HAIuTlPQH99rnsFNnRZCtTpdbT7GRckAMeWMvy19bF2ZB1FXuSn+h
5dxIo0qkw8nv4Y9wQ6QmgOcSzYyidUrCgLTO516qXVAKY0kl/u4q/zPr0Fmx/qfY
oxspelDv0qd2NMQwJ/AmwjAjkQBulv5DVLu+cDXdOMkc/EbhzWyvetcHiNukxjHn
mukCBxTlLoDLug2LFkAPIddEutj+VUEefkf/pD/js8uYuyd9ZnPjiIh6fG25il9a
cHbMYtANt2EnZjwI9Z74
=T6t1
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This is relatively small, mostly to get the SG/crypto
from stack removal fix that crashes things when VMAP
stack is used in conjunction with software crypto.
Aside from that, we have:
* a fix for AP_VLAN usage with the nl80211 frame command
* two fixes (and two preparation patches) for A-MSDU, one
to discard group-addressed (multicast) and unexpected
4-address A-MSDUs, the other to validate A-MSDU inner
MAC addresses properly to prevent controlled port bypass
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.
skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.
The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to IEEE 802.11-2012 section 8.3.2 table 8-19, the outer SA/DA
of A-MSDU frames need to be changed depending on FromDS/ToDS values.
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
[use ether_addr_copy and add alignment annotations]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Users of lwt tunnels may set up some secondary state in build_state
function. Add a corresponding destroy_state function to allow users to
clean up state. This destroy state function is called from lwstate_free.
Also, we now free lwstate using kfree_rcu so user can assume structure
is not freed before rcu.
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only 32 more to go...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYAOP4AAoJEI3ONVYwIuV6CIgQAKqtI3i99xOJcuVJfojHYo0p
LRLwIX0RxkQCb+nCPLJTjH+NLQ5Zw3BLFTmizcewJJuYnv8eBbBcAEsegvrkIl7B
0KmHEttdWFkujE+kISmfI6WsvxiFt+VbcjqgFMNM7D5Xw352x3v3X9VMPO7P/5lz
ztWCdYZxhH2qFmeDiNmKMnPqtUJjOppTR73jqMzPHUI4PQcFxzGaTRwntuCJQ/XA
fRwcTEQAX3r/xdCDb7+tIq00i+J8ZDTqwng9/8GqlWyjeDQZG8CmaGvDBwA1+n+X
zG6lmOHLPIBppOF8rUQ1Q1ZlZl5x0jPDoo19mGdlQ+IgZocdo4z43XTc0c+oLguA
zjiXKJXn1EJvl4iKLeF6nkxJxESJioCNg3eXqFPLFjYDSWzrK7umTkiJMLB9UbqN
ThqrxgVrMpKjSug9KKqItu47WZ4s+dczkkngyiqMUUo34RDnfCQXwCBO7JAdzyyH
XnwrCVj6hD8SIIv2REWNAiBTzIqEZxNmc9qvwj+Xy18hXKYhqqYRtf35QL3adp9R
Nigl9dtcTdCkNJiaVYgfcTz/9ZMLcrKcMFV27ExMYZiDce2T7YWnnE5/VpheXi0r
/EULZLxKJgu99SHACLmK1ZWD8YuoqlRZVQtk8LTNOBAiu8sasrwVYy7meQWlsL/8
Q/bmUmWXUD3I6CwIaS1R
=U9Pc
-----END PGP SIGNATURE-----
Merge tag 'docs-4.9-2' of git://git.lwn.net/linux
Pull one more documentation update from Jonathan Corbet:
"A single commit converting the mac80211 DocBook template over to
Sphinx. Only 32 more to go..."
* tag 'docs-4.9-2' of git://git.lwn.net/linux:
docs-rst: sphinxify 802.11 documentation
The IPv6 temporary address generation uses a variable called DESYNC_FACTOR
to prevent hosts updating the addresses at the same time. Quoting RFC 4941:
... The value DESYNC_FACTOR is a random value (different for each
client) that ensures that clients don't synchronize with each other and
generate new addresses at exactly the same time ...
DESYNC_FACTOR is defined as:
DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR.
It is computed once at system start (rather than each time it is used)
and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE).
First, I believe the RFC has a typo in it and meant to say: "and must
never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)"
The reason is that at various places in the RFC, DESYNC_FACTOR is used in
a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be
smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of
these calculations to be larger than zero. It's never used in a
calculation together with TEMP_VALID_LIFETIME.
I already submitted an errata to the rfc-editor:
https://www.rfc-editor.org/errata_search.php?rfc=4941
The Linux implementation of DESYNC_FACTOR is very wrong:
max_desync_factor is used in places DESYNC_FACTOR should be used.
max_desync_factor is initialized to the RFC-recommended value for
MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value.
And nothing ensures that the value used is not greater than
(TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows. The
effect can easily be observed when setting the temp_prefered_lft sysctl
e.g. to 60. The preferred lifetime of the temporary addresses will be
bogus.
TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be
influenced by these three sysctls: regen_max_retry, dad_transmits and
temp_prefered_lft. Thus, the upper bound for desync_factor needs to be
re-calculated each time a new address is generated and if desync_factor is
larger than the new upper bound, a new random value needs to be
re-generated.
And since we already have max_desync_factor configurable per interface, we
also need to calculate and store desync_factor per interface.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The randomized interface identifier (rndid) was periodically updated from
the regen_timer timer. Simplify the code by updating the rndid only when
needed by ipv6_try_regen_rndid().
This makes the follow-up DESYNC_FACTOR fix much simpler. Also it fixes a
reference counting error in this error path, where an in6_dev_put was
missing:
err = addrconf_sysctl_register(ndev);
if (err) {
ipv6_mc_destroy_dev(ndev);
- del_timer(&ndev->regen_timer);
snmp6_unregister_dev(ndev);
goto err_release;
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
These accessors are used in various drivers that support tc offloading,
to detect properties of a given 'tc_action'.
'is_tcf_mirred_redirect' tests that the action is TCA_EGRESS_REDIR.
'is_tcf_mirred_mirror' tests that the action is TCA_EGRESS_MIRROR.
As a prep towards supporting INGRESS redir/mirror, rename these
predicates to reflect their true meaning:
s/is_tcf_mirred_redirect/is_tcf_mirred_egress_redirect/
s/is_tcf_mirred_mirror/is_tcf_mirred_egress_mirror/
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Ido Schimmel <idosch@mellanox.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'tcfm_ok_push' specifies whether a mac_len sized push is needed upon
egress to the target device (if action is performed at ingress).
Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of
the target device (and use a bool instead of int).
This allows to decouple the attribute from the action to be taken.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix various build warnings in tlan/qed/xen-netback drivers, from
Arnd Bergmann.
2) Propagate proper error code in strparser's strp_recv(), from Geert
Uytterhoeven.
3) Fix accidental broadcast of RTM_GETTFILTER responses, from Eric
Dumazret.
4) Need to use list_for_each_entry_safe() in qed driver, from Wei
Yongjun.
5) Openvswitch 802.1AD bug fixes from Jiri Benc.
6) Cure BUILD_BUG_ON() in mlx5 driver, from Tom Herbert.
7) Fix UDP ipv6 checksumming in netvsc driver, from Stephen Hemminger.
8) stmmac driver fixes from Giuseppe CAVALLARO.
9) Fix access to mangled IP6CB in tcp, from Eric Dumazet.
10) Fix info leaks in tipc and rtnetlink, from Dan Carpenter.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
net: bridge: add the multicast_flood flag attribute to brport_attrs
net: axienet: Remove unused parameter from __axienet_device_reset
liquidio: CN23XX: fix a loop timeout
net: rtnl: info leak in rtnl_fill_vfinfo()
tipc: info leak in __tipc_nl_add_udp_addr()
net: ipv4: Do not drop to make_route if oif is l3mdev
net: phy: Trigger state machine on state change and not polling.
ipv6: tcp: restore IP6CB for pktoptions skbs
netvsc: Remove mistaken udp.h inclusion.
xen-netback: fix type mismatch warning
stmmac: fix error check when init ptp
stmmac: fix ptp init for gmac4
qed: fix old-style function definition
netvsc: fix checksum on UDP IPV6
net_sched: reorder pernet ops and act ops registrations
xen-netback: fix guest Rx stall detection (after guest Rx refactor)
drivers/ptp: Fix kernel memory disclosure
net/mlx5: Add MLX5_ARRAY_SET64 to fix BUILD_BUG_ON
qmi_wwan: add support for Quectel EC21 and EC25
openvswitch: add NETIF_F_HW_VLAN_STAG_TX to internal dev
...
Commit e0d56fdd73 was a bit aggressive removing l3mdev calls in
the IPv4 stack. If the fib_lookup fails we do not want to drop to
make_route if the oif is an l3mdev device.
Also reverts 19664c6a00 ("net: l3mdev: Remove netif_index_is_l3_master")
which removed netif_index_is_l3_master.
Fixes: e0d56fdd73 ("net: l3mdev: remove redundant calls")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The prsctp polices include ttl expires policy already, we should remove
the old ttl expires codes, and just adjust the new polices' codes to be
compatible with the old one for users.
This patch is to remove all the old expires codes, and if prsctp polices
are not set, it will still set msg's expires_at and check the expires in
sctp_check_abandoned.
Note that asoc->prsctp_enable is set by default, so users can't feel any
difference even if they use the old expires api in userspace.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sctp uses chunk->resent to record if a chunk is retransmitted, for
RTT measurements with retransmitted DATA chunks. chunk->sent_count was
introduced to record how many times one chunk has been sent for prsctp
RTX policy before. We actually can know if one chunk is retransmitted
by checking chunk->sent_count is greater than 1.
This patch is to remove resent from sctp_chunk and reuse sent_count
to avoid retransmitted chunks for RTT measurements.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit provides a mechanism for the host drivers to advertise the
support for different beacon intervals among the respective interface
combinations in a group, through NL80211_IFACE_COMB_BI_MIN_GCD (u32).
This value will be compared against GCD of all beaconing interfaces of
matching combinations.
If the driver doesn't advertise this value, the old behaviour where
all beacon intervals must be identical is retained.
If it is specified, then any beacon interval for an interface in the
interface combination as well as the GCD of all active beacon intervals
in the combination must be greater or equal to this value.
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[change commit message, some variable names, small other things]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Move the growing parameter list to a structure for the interface
combination check and iteration functions in cfg80211 and mac80211
to make the code easier to understand.
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
[edit commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We should not accept arbitrary DA/SA inside A-MSDUs, it could be used
to circumvent protections, like allowing a station to send frames and
make them seem to come from somewhere else.
Add the necessary infrastructure in cfg80211 to allow such checks, in
further patches we'll start using them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's only a single case where has_80211_header is passed as true,
which is in mac80211. Given that there's only simple code that needs
to be done before calling it, export that function from cfg80211
instead and let mac80211 call it itself.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull uaccess.h prepwork from Al Viro:
"Preparations to tree-wide switch to use of linux/uaccess.h (which,
obviously, will allow to start unifying stuff for real). The last step
there, ie
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
`git grep -l "$PATT"|grep -v ^include/linux/uaccess.h`
is not taken here - I would prefer to do it once just before or just
after -rc1. However, everything should be ready for it"
* 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
remove a stray reference to asm/uaccess.h in docs
sparc64: separate extable_64.h, switch elf_64.h to it
score: separate extable.h, switch module.h to it
mips: separate extable.h, switch module.h to it
x86: separate extable.h, switch sections.h to it
remove stray include of asm/uaccess.h from cacheflush.h
mn10300: remove a bogus processor.h->uaccess.h include
xtensa: split uaccess.h into C and asm sides
bonding: quit messing with IOCTL
kill __kernel_ds_p off
mn10300: finish verify_area() off
frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h
exceptions: detritus removal
This is just a very basic conversion, I've split up the original
multi-book template, and also split up the multi-part mac80211
part in the original book; neither of those were handled by the
automatic pandoc conversion.
Fix errors that showed up, resulting in a much nicer rendering,
at least for the interface combinations documentation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.
The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.
To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.
Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.
There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.
These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve the merge conflict between Felix's/my and Toke's patches
coming into the tree through net and mac80211-next respectively.
Most of Felix's changes go away due to Toke's new infrastructure
work, my patch changes to "goto begin" (the label wasn't there
before) instead of returning NULL so flow control towards drivers
is preserved better.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(),
to stop the NCSI device so that it can be reenabled in future. This
API should be called when the network device driver is going to
shutdown the device. There are 3 things done in the function: Stop
the channel monitoring; Reset channels to inactive state; Report
NCSI link down.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_mpls_header is equivalent to mpls_hdr now. Use the existing helper
instead.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be also used by openvswitch.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TXQ intermediate queues can cause packet reordering when more than
one flow is active to a single station. Since some of the wifi-specific
packet handling (notably sequence number and encryption handling) is
sensitive to re-ordering, things break if they are applied before the
TXQ.
This splits up the TX handlers and fast_xmit logic into two parts: An
early part and a late part. The former is applied before TXQ enqueue,
and the latter after dequeue. The non-TXQ path just applies both parts
at once.
Because fragments shouldn't be split up or reordered, the fragmentation
handler is run after dequeue. Any fragments are then kept in the TXQ and
on subsequent dequeues they take precedence over dequeueing from the FQ
structure.
This approach avoids having to scatter special cases all over the place
for when TXQ is enabled, at the cost of making the fast_xmit and TX
handler code slightly more complex.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[fix a few code-style nits, make ieee80211_xmit_fast_finish void,
remove a useless txq->sta check]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows the mesh sync (and debugfs) code to make incremental
TSF adjustments, avoiding any uncertainty introduced by delay in
programming absolute TSF.
Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The reusable fairness queueing implementation (fq.h) lacks the memory
usage limit that the fq_codel qdisc has. This means that small
devices (e.g. WiFi routers) can run out of memory when flooded with a
large number of packets. This ports the memory limit feature from
fq_codel to fq.h.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide an API to report NAN function match. Mac80211 will lookup the
corresponding cookie and report the match to cfg80211.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement add/rm_nan_func functions and handle NAN function
termination notifications. Handle instance_id allocation for
NAN functions and implement the reconfig flow.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement nan_change_conf callback which allows to change current
NAN configuration (master preference and dual band operation).
Store the current NAN configuration in sdata, so it can be used
both to provide the driver the updated configuration with changes
and also it will be used in hw reconfig flows in next patches.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide a function that reports NAN DE function termination. The function
may be terminated due to one of the following reasons: user request,
ttl expiration or failure.
If the NAN instance is tied to the owner, the notification will be
sent to the socket that started the NAN interface only
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide a function the driver can call to report a match.
This will send the event to the user space.
If the NAN instance is tied to the owner, the notifications will be
sent to the socket that started the NAN interface only.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some NAN configuration paramaters may change during the operation of
the NAN device. For example, a user may want to update master preference
value when the device gets plugged/unplugged to the power.
Add API that allows to do so.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A NAN function can be either publish, subscribe or follow
up. Make all the necessary verifications and just pass the
request to the driver.
Allow the user space application that starts NAN to
forbid any other socket to add or remove functions.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This code doesn't do much besides allowing to start and
stop the vif.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows user space to start/stop NAN interface.
A NAN interface is like P2P device in a few aspects: it
doesn't have a netdev associated to it.
Add the new interface type and prevent operations that
can't be executed on NAN interface like scan.
Define several attributes that may be configured by user space
when starting NAN functionality (master preference and dual
band operation)
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for drivers that implement static WEP internally, i.e.
expose connection keys to the driver in connect flow and don't
upload the keys after the connection.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Now sctp uses chunk->prsctp_param to save the prsctp param for all the
prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk.
We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices,
and reuse msg->expires_at for TTL policy, as the prsctp polices and old
expires policy are mutual exclusive.
This patch is to remove prsctp_param from sctp_chunk, and reuse msg's
expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF
polices.
Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy,
as it needs a u64 variables to save the expires_at time.
This one also fixes the "netperf-Throughput_Mbps -37.2% regression"
issue.
Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now pahole sctp_chunk, it has 2 memory holes:
struct sctp_chunk {
struct list_head list;
atomic_t refcnt;
/* XXX 4 bytes hole, try to pack */
...
long unsigned int prsctp_param;
int sent_count;
/* XXX 4 bytes hole, try to pack */
This patch is to move up sent_count to fill the 1st one and eliminate
the 2nd one.
It's not just another struct compaction, it also fixes the "netperf-
Throughput_Mbps -37.2% regression" issue when overloading the CPU.
Fixes: a6c2f79287 ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements:
https://tools.ietf.org/html/rfc7559
Backoff is performed according to RFC3315 section 14:
https://tools.ietf.org/html/rfc3315#section-14
We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations
to a negative value meaning an unlimited number of retransmits,
and we make this the new default (inline with the RFC).
We also add a new setting:
/proc/sys/net/ipv6/conf/*/router_solicitation_max_interval
defaulting to 1 hour (per RFC recommendation).
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to introduce the generic interfaces for snmp_get_cpu_field{,64}.
It exchanges the two for-loops for collecting the percpu statistics data.
This can aggregate the data by going through all the items of each cpu
sequentially.
Signed-off-by: Jia He <hejianet@gmail.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the created tc actions list is reversed against the order
set by the user.
Change the actions list order to be the same as was set by the user.
This patch doesn't affect dump actions behavior.
For dumping, action->order parameter is used so the list order doesn't
matter.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since this is now taken care of by FIB notifier, remove the code, with
all unused dependencies.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These helpers are to be used in case someone offloads the FIB entry. The
result is that if the entry is offloaded to at least one device, the
offload flag is set.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to pass information about added/deleted FIB entries/rules to
whoever is interested. This is done in a very similar way as devinet
notifies address additions/removals.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only remaining users are issuing SIOCGMIIPHY and SIOCGMIIREG,
neither of which deals with userland pointers. Simply calling
->ndo_do_ioctl() is fine; no messing with set_fs() is needed.
It used to mess with SIOCETHTOOL, which would've needed set_fs(),
but that has been killed in "[NET] ethtool ops are the only way"
9 years ago...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The previous commit added support for specifying the beacon rate
for AP mode. Add features checks to this, and extend it to also
support the rate configuration for mesh networks. For IBSS it's
not as simple due to joining etc., so that's not yet supported.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows an option to configure a single beacon tx rate for an AP.
Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
net/netfilter/core.c
net/netfilter/nf_tables_netdev.c
Resolve two conflicts before pull request for David's net-next tree:
1) Between c73c248490 ("netfilter: nf_tables_netdev: remove redundant
ip_hdr assignment") from the net tree and commit ddc8b6027a
("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").
2) Between e8bffe0cf9 ("net: Add _nf_(un)register_hooks symbols") and
Aaron Conole's patches to replace list_head with single linked list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.
So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.
Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:
data < a || data > b
This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:
cmp(sreg, data, >=)
cmp(sreg, data, <=)
This new range expression provides an alternative way to express this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.
In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A future patch will modify the hook drop and outfn functions. This will
cause the line lengths to take up too much space. This is simply a
readability change.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.
The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.
The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.
It's used only in the recursion cases of br_netfilter. It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.
This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed e.g for offloading drivers to pick the relevant attributes.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it similar to time_before() macros:
- easier to understand
- make use of typecheck() to avoid working on unexpected variable types
(made the issue on previous patch visible)
- for _[lg]te versions, slighly faster, as the compiler used to generate
a sequence of cmp/je/cmp/js instructions and now it's sub/test/jle
(for _lte):
Before, for sctp_outq_sack:
if (primary->cacc.changeover_active) {
1f01: 80 b9 84 02 00 00 00 cmpb $0x0,0x284(%rcx)
1f08: 74 6e je 1f78 <sctp_outq_sack+0xe8>
u8 clear_cycling = 0;
if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) {
1f0a: 8b 81 80 02 00 00 mov 0x280(%rcx),%eax
return ((s) - (t)) & TSN_SIGN_BIT;
}
static inline int TSN_lte(__u32 s, __u32 t)
{
return ((s) == (t)) || (((s) - (t)) & TSN_SIGN_BIT);
1f10: 8b 7d bc mov -0x44(%rbp),%edi
1f13: 39 c7 cmp %eax,%edi
1f15: 74 25 je 1f3c <sctp_outq_sack+0xac>
1f17: 39 f8 cmp %edi,%eax
1f19: 78 21 js 1f3c <sctp_outq_sack+0xac>
primary->cacc.changeover_active = 0;
After:
if (primary->cacc.changeover_active) {
1ee7: 80 b9 84 02 00 00 00 cmpb $0x0,0x284(%rcx)
1eee: 74 73 je 1f63 <sctp_outq_sack+0xf3>
u8 clear_cycling = 0;
if (TSN_lte(primary->cacc.next_tsn_at_change, sack_ctsn)) {
1ef0: 8b 81 80 02 00 00 mov 0x280(%rcx),%eax
1ef6: 2b 45 b4 sub -0x4c(%rbp),%eax
1ef9: 85 c0 test %eax,%eax
1efb: 7e 26 jle 1f23 <sctp_outq_sack+0xb3>
primary->cacc.changeover_active = 0;
*_lt() generated pretty much the same code.
Tested with gcc (GCC) 6.1.1 20160621.
This patch also removes SSN_lte as it is not used and cleanups some
comments.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit ac28634456 ("netfilter: bridge: add nf_afinfo to enable
queuing to userspace"), we can queue packets to the user space in bridge
family. But when the user specify the queue range, packets will be only
delivered to the first queue num. Because in nfqueue_hash, we only support
ipv4 and ipv6 family. Now add support for bridge family too.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.
This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").
Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.
So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.
Reported-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2016-09-21
1) Propagate errors on security context allocation.
From Mathias Krause.
2) Fix inbound policy checks for inter address family tunnels.
From Thomas Zeitlhofer.
3) Fix an old memory leak on aead algorithm usage.
From Ilan Tayari.
4) A recent patch fixed a possible NULL pointer dereference
but broke the vti6 input path.
Fix from Nicolas Dichtel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Call into offloaded filters to update stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined. Create a new attribute to be able to use
the same flag values as the above.
Unlike U32 and flower reject unknown flags.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 1625f45299, vti6 is broken, all input packets are dropped
(LINUX_MIB_XFRMINNOSTATES is incremented).
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 is set by vti6_rcv() before calling
xfrm6_rcv()/xfrm6_rcv_spi(), thus we cannot set to NULL that value in
xfrm6_rcv_spi().
A new function xfrm6_rcv_tnl() that enables to pass a value to
xfrm6_rcv_spi() is added, so that xfrm6_rcv() is not touched (this function
is used in several handlers).
CC: Alexey Kodanev <alexey.kodanev@oracle.com>
Fixes: 1625f45299 ("net/xfrm_input: fix possible NULL deref of tunnel.ip6->parms.i_key")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.
This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.
We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.
With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.
For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.
This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:
1) Export tcp_tso_autosize() so that CC modules can use it.
2) Change tcp_tso_autosize() to allow callers to specify a minimum
number of segments per TSO skb, in case the congestion control
module has a different notion of the best floor for TSO skbs for
the connection right now. For very low-rate paths or policed
connections it can be appropriate to use smaller TSO skbs.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.
This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.
Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.
This logic marks the flow app-limited on a write if *all* of the
following are true:
1) There is less than 1 MSS of unsent data in the write queue
available to transmit.
2) There is no packet in the sender's queues (e.g. in fq or the NIC
tx queue).
3) The connection is not limited by cwnd.
4) There are no lost packets to retransmit.
The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.
When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.
The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.
We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.
Key state:
tp->delivered: Tracks the total number of data packets (original or not)
delivered so far. This is an already-existing field.
tp->delivered_mstamp: the last time tp->delivered was updated.
Algorithm:
A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
d1: the current tp->delivered after processing the ACK
t1: the current time after processing the ACK
d0: the prior tp->delivered when the acked skb was transmitted
t0: the prior tp->delivered_mstamp when the acked skb was transmitted
When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().
When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).
Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.
At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.
If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:
bw = delivered / ack_phase_rtt
to the following:
bw = delivered / max(send_phase_rtt, ack_phase_rtt)
In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.
This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-09-19
Here's the main bluetooth-next pull request for the 4.9 kernel.
- Added new messages for monitor sockets for better mgmt tracing
- Added local name and appearance support in scan response
- Added new Qualcomm WCNSS SMD based HCI driver
- Minor fixes & cleanup to 802.15.4 code
- New USB ID to btusb driver
- Added Marvell support to HCI UART driver
- Add combined LED trigger for controller power
- Other minor fixes here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables prepending appearance value to scan response data.
It also adds support for setting appearance value through mgmt command.
If currently advertised instance has apperance flag set it is expired
immediately.
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
While the subsystem version information are purely informational,
increase the minor number due to the addition of user channel and
management control monitoring suppport. It is helpful for debugging
purposes to see the version numbers change.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This command is used to retrieve the current state and basic
information of a controller. It is typically used right after
getting the response to the Read Controller Index List command
or an Index Added event (or its extended counterparts).
When any of the values in the EIR_Data field changes, the event
Extended Controller Information Changed will be used to inform
clients about the updated information.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Michał Narajowski <michal.narajowski@codecoup.pl>
Instead of keeping a version string around, use version and revision
numbers and then stringify them for use as module parameter.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of hiding everything behind a general managment events flag,
introduce indivdual flags that allow fine control over which events are
send to a given management channel.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This adds support for tracing all management commands and events via the
monitor interface.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This sends new notifications to the monitor support whenever a
management channel has been opened or closed. This allows tracing of
control channels really easily.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The mgmt version information will be also needed for the control
changell tracing feature. This provides a helper to pack them.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
To further allow unique identification and tracking of control socket,
store cookie and comm information when binding the socket.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Commit 5177a83827 ("Bluetooth: Add debugfs fields for hardware and
firmware info") introduced hci_set_hw_info() and hci_set_fw_info().
These functions use kvasprintf_const() but are not marked with a
__printf attribute. Adding such an attribute helps detecting issues
related to printf-formatting at build time.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch assigns the next free HCI device identifier to Bluetooth
devices based on the Qualcomm Shared Memory channels.
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The led_trigger field in hci_dev should be conditional based on if
CONFIG_BT_LEDS is set or not.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.
Its similar to the skb_buff_head api, but does not use skb->prev pointers.
Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.
The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moves qdisc stat accouting to qdisc_dequeue_head.
The only direct caller of the __qdisc_dequeue_head version open-codes
this now.
This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* MU-MIMO sniffer support in mac80211
* a create_singlethread_workqueue() cleanup
* interface dump filtering that was documented but not implemented
* support for the new radiotap timestamp field
* send delBA in two unexpected conditions (as required by the spec)
* connect keys cleanups - allow only WEP with index 0-3
* per-station aggregation limit to work around broken APs
* debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJX2+umAAoJEGt7eEactAAdIMkP/jMmpbxkzD64L7nTkO4APGva
r6RmMM1SmgVD/CtVkjlBLuvo5YOTWv/vWvy6KoUESOINAx/e6T3T7bmmCOXzbsOL
e5/YYcS1AOqgn5SdhgIj1E5cpdYIhlUGRlNJ0qEjeLLrh4/TLUNbCcuPhOYybUMz
fUrdPKgDeWb7x9EHLENhPsVtCXWwKnkDIS4qclPZCWgRj46XM4pNB4OlvCUzGY6k
bOqGJfrtjYjgKFDmPFqfYA4JDA56980qqO41+eEKXeMvDKNs+pSiNco130Q+uU3E
o7tk9DMnAnCy2GihpV1ZYVkLr6O+7o9xVuenj3NRlhyd1mn2gXxLcO4AkHcrZBkf
Ei+2L+KgnWELyqiSOaGTJKlugsgS4DDoNnFEIVjSweQ9DIoBA/Gj/6+4uZeHXJ3M
bEjtHnCLi5CuI067uBoevwXFoMi1poWra2KnZKOZzFS5OL3xHv4//x/Wmnn2/5Jz
ffEwVyRmTY76sLWfnwXUDClrFWAYQrpNyTryc+k3cpYKzhnseiqt+z43cBuISm00
uh5B9PpPB8RhtUnXrL/SHRyf8YEluaidTsI2lc1LvwXOc0+Zbp73mTCgP+rzLs9p
K2qVRiozpIXanW6hKmmaDwjKlcAKKLP0xN2v90MqwQt4YdLIKlXnll1AH2BawzuP
OWB3n8D0I6y0PWH+Yo8o
=s1MY
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This time we have various things - all across the board:
* MU-MIMO sniffer support in mac80211
* a create_singlethread_workqueue() cleanup
* interface dump filtering that was documented but not implemented
* support for the new radiotap timestamp field
* send delBA in two unexpected conditions (as required by the spec)
* connect keys cleanups - allow only WEP with index 0-3
* per-station aggregation limit to work around broken APs
* debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.
This Patch is actually to revert commit 1cd4d5c432 ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.
Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels
to operate in 'collect metadata' mode.
Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function
for both collect_md and traditional tunnels.
bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to
operate in 'collect metadata' mode.
bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away.
ovs can use it as well in the future (once appropriate ovs-vport
abstractions and user apis are added).
Note that just like in other tunnels we cannot cache the dst,
since tunnel_info metadata can be different for every packet.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer used after e0d56fdd73 ("net: l3mdev: remove redundant calls")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function actually operates on u32 yet its paramteres were declared
as u16, causing integer truncation upon calling.
Note in patch context that ADDIP_SERIAL_SIGN_BIT is already 32 bits.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A malicious TCP receiver, sending SACK, can force the sender to split
skbs in write queue and increase its memory usage.
Then, when socket is closed and its write queue purged, we might
overflow sk_forward_alloc (It becomes negative)
sk_mem_reclaim() does nothing in this case, and more than 2GB
are leaked from TCP perspective (tcp_memory_allocated is not changed)
Then warnings trigger from inet_sock_destruct() and
sk_stream_kill_queues() seeing a not zero sk_forward_alloc
All TCP stack can be stuck because TCP is under memory pressure.
A simple fix is to preemptively reclaim from sk_mem_uncharge().
This makes sure a socket wont have more than 2 MB forward allocated,
after burst and idle period.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a few places where an IE that matches not only the EID, but
also other bytes inside the element, needs to be found. To simplify
that and reduce the amount of similar code, implement a new helper
function to match the EID and an extra array of bytes.
Additionally, simplify cfg80211_find_vendor_ie() by using the new
match function.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for the 2-bytes Qualcomm tag that gigabit switches such as
the QCA8337/N might insert when receiving packets, or that we need
to insert while targeting specific switch ports. The tag is inserted
directly behind the ethernet header.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe
to:
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15
Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.
In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.
The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on consecutive msdu failures, mac80211 triggers CQM packet-loss
mechanism. Drivers like ath10k that have its own connection monitoring
algorithm, offloaded to firmware for triggering station kickout. In case
of station kickout, driver will report low ack status by mac80211 API
(ieee80211_report_low_ack).
This flag will enable the driver to completely rely on firmware events
for station kickout and bypass mac80211 packet loss mechanism.
Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
No drivers implement this, relying either on the recursive
directory removal to remove their debugfs, or not having any
to start with. Remove the dead driver callback.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Endianess fix for the new nf_tables netlink trace infrastructure,
NFTA_TRACE_POLICY endianess was not correct, patch from Liping Zhang.
2) Fix broken re-route after userspace queueing in nf_tables route
chain. This patch is large but it is simple since it is just getting
this code in sync with iptable_mangle. Also from Liping.
3) NAT mangling via ctnetlink lies to userspace when nf_nat_setup_info()
fails to setup the NAT conntrack extension. This problem has been
there since the beginning, but it can now show up after rhashtable
conversion.
4) Fix possible NULL pointer dereference due to failures in allocating
the synproxy and seqadj conntrack extensions, from Gao feng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When memory is exhausted, nfct_seqadj_ext_add may fail to add the
synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't
check if get valid seqadj pointer by the nfct_seqadj.
Now drop the packet directly when fail to add seqadj extension to
avoid dereference NULL pointer in nf_ct_seqadj_init from
init_conntrack().
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
hash_v6 is used by both nftables and ip6tables, so depend on
IP6_NF_IPTABLES is not properly.
Actually, it only parses ipv6hdr and computes a hash value, so
even if IPV6 is disabled, there's no side effect too, remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is overly conservative and not flexible at all, so better let them
go through and let the filtering policy decide what to do with them. We
use skb_header_pointer() all over the place so we would just fail to
match when trying to access fields from malformed traffic.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Consolidate pktinfo setup and validation by using the new generic
functions so we converge to the netdev family codebase.
We only need a linear IPv4 and IPv6 header from the reject expression,
so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate()
to net/bridge/netfilter/nft_reject_bridge.c.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These functions are extracted from the netdev family, they initialize
the pktinfo structure and validate that the IPv4 and IPv6 headers are
well-formed given that these functions are called from a path where
layer 3 sanitization did not happen yet.
These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h
so they can be reused by a follow up patch to use them from the bridge
family too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure the pktinfo protocol fields are initialized if this fails to
parse the transport header.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.
This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.
The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the existing device timestamp from the RX status information
to add support for the new radiotap timestamp field. Currently
only 32-bit counters are supported, but we also add the radiotap
mactime where applicable. This new field allows more flexibility
in where the timestamp is taken etc. The non-timestamp data in
the field is taken from a new field in the hw struct.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ability to change the max_rx_aggregation frames is useful
in cases of IOP.
There exist some devices (latest mobile phones and some AP's)
that tend to not respect a BA sessions maximum size (in Kbps).
These devices won't respect the AMPDU size that was negotiated during
association (even though they do respect the maximal number of packets).
This violation is characterized by a valid number of packets in
a single AMPDU. Even so, the total size will exceed the size negotiated
during association.
Eventually, this will cause some undefined behavior, which in turn
causes the hw to drop packets, causing the throughput to plummet.
This patch will make the subframe limitation to be held by each station,
instead of being held only by hw.
Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
cfg80211 expects the .disconnect() handler to call
cfg80211_disconnect() when done. Make this requirement
more explicit.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv6 output processing to
send the packet to the VRF driver on xmit.
Link scope addresses (linklocal and multicast) need special handling:
specifically the oif the flow struct can not be changed because we
want the lookup tied to the enslaved interface. ie., the source address
and the returned route MUST point to the interface scope passed in.
Convert the existing vrf_get_rt6_dst to handle only link scope addresses.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow an L3 master device to act as the loopback for that L3 domain.
For IPv4 the device can also have the address 127.0.0.1.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.
The action will release the metadata created by the tunnel device
(decap), or set the metadata with the specified values for encap
operation.
For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:
$ tc filter add dev net0 protocol ip parent ffff: \
flower \
ip_proto 1 \
dst_ip 11.11.11.2 \
action tunnel_key set \
src_ip 11.11.0.1 \
dst_ip 11.11.0.2 \
id 11 \
action mirred egress redirect dev vxlan0
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract __ip_tun_set_dst() and __ipv6_tun_set_dst() out of
ip_tun_rx_dst() and ipv6_tun_rx_dst(), to be used without supplying an
skb.
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add utility functions to convert a 32 bits key into a 64 bits tunnel and
vice versa.
These functions will be used instead of cloning code in GRE and VXLAN,
and in tc act_iptunnel which will be introduced in a following patch in
this patchset.
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV9FCuvSw1s6N8H32AQKo5w/8CySGsorFk67/QiQGdBt+URd8cxR2NuvF
i3P7Kbo30ycJO7Q4Uc4DvO3kTqWiNMbXWVgGLfA64HDFojjuuXfQdFwf98FZ2WtQ
OxQUV5fzSPFwlDktd5nWm5qTCdv7+lIvBCVsEPuX2pSkc7HesiYMsZt2ilOac9Ho
Meon2/S1oq3hctZv2DTiaI+Ae8YBMar7GSUfylRGa2TkXCgG8eYcjGyGigLJ2F03
e+/8w6+jtrW5hASCJPI9re+qiYgmnYa7UVpwrVjM1dVOYYZfmU02Jq6HgW9bSd24
MYk6neksMGVpQbVmAbj5/MmxUg98q8UpY9ygt2IWP4UvGNDYBGCiSbfyQoTnoWUP
02k3E6HnFfs8SPbxuNmA4uB2BHL2y87+G8u1g0IUZkT8i3zFwLd01UBwJqB23tYE
EIRAad1xWwGaSJGyFgsmry1RJsitSUAG9w/68Ni1IMQxsHsIROTz6TNBki1tMcOh
AAsbj4iJ0rJ2Ca/Xbk9kAdPzEr85ZA3Za5BwA9ZDwZjmt2X1RrzuK9gIaKB8hsWS
zVjRjpvSOaTyx97rtEVfkT310GMGYC5r9ba+kE4ukGeHWKRVkMk5tkADZw9RFKdf
ubXN/zyfv4YABHHUIfQn5UgHHmxl4GpN0CD+cY7hPtmB9J2wvsadckqrzBOFIQL+
dg7jZAb+fjc=
=GfEj
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160908' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Rewrite data and ack handling
This patch set constitutes the main portion of the AF_RXRPC rewrite. It
consists of five fix/helper patches:
(1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values.
(2) Update some protocol definitions slightly.
(3) Use of an hlist for RCU purposes.
(4) Removal of per-call sk_buff accounting (not really needed when skbs
aren't being queued on the main queue).
(5) Addition of a tracepoint to log incoming packets in the data_ready
callback and to log the end of the data_ready callback.
And then there are two patches that form the main part:
(6) Preallocation of resources for incoming calls so that in patch (7) the
data_ready handler can be made to fully instantiate an incoming call
and make it live. This extends through into AFS so that AFS can
preallocate its own incoming call resources.
The preallocation size is capped at the listen() backlog setting - and
that is capped at a sysctl limit which can be set between 4 and 32.
The preallocation is (re)charged either by accepting/rejecting pending
calls or, in the case of AFS, manually. If insufficient preallocation
resources exist, a BUSY packet will be transmitted.
The advantage of using this preallocation is that once a call is set
up in the data_ready handler, DATA packets can be queued on it
immediately rather than the DATA packets being queued for a background
work item to do all the allocation and then try and sort out the DATA
packets whilst other DATA packets may still be coming in and going
either to the background thread or the new call.
(7) Rewrite the handling of DATA, ACK and ABORT packets.
In the receive phase, DATA packets are now held in per-call circular
buffers with deduplication, out of sequence detection and suchlike
being done in data_ready. Since there is only one producer and only
once consumer, no locks need be used on the receive queue.
Received ACK and ABORT packets are now parsed and discarded in
data_ready to recycle resources as fast as possible.
sk_buffs are no longer pulled, trimmed or cloned, but rather the
offset and size of the content is tracked. This particularly affects
jumbo DATA packets which need insertion into the receive buffer in
multiple places. Annotations are kept to track which bit is which.
Packets are no longer queued on the socket receive queue; rather,
calls are queued. Dummy packets to convey events therefore no longer
need to be invented and metadata packets can be discarded as soon as
parsed rather then being pushed onto the socket receive queue to
indicate terminal events.
The preallocation facility added in (6) is now used to set up incoming
calls with very little locking required and no calls to the allocator
in data_ready.
Decryption and verification is now handled in recvmsg() rather than in
a background thread. This allows for the future possibility of
decrypting directly into the user buffer.
With this patch, the code is a lot simpler and most of the mass of
call event and state wangling code in call_event.c is gone.
With this, the majority of the AF_RXRPC rewrite is complete. However,
there are still things to be done, including:
(*) Limit the number of active service calls to prevent an attacker from
filling up a server's memory.
(*) Limit the number of calls on the rebuff-with-BUSY queue.
(*) Transmit delayed/deferred ACKs from recvmsg() if possible, rather than
punting to the background thread. Ideally, the background thread
shouldn't run at all, but data_ready can't call kernel_sendmsg() and
we can't rely on recvmsg() attending to the call in a timely fashion.
(*) Prevent the call at the front of the socket queue from hogging
recvmsg()'s attention if there's a sufficiently continuous supply of
data.
(*) Distribute ICMP errors by connection rather than by call. Possibly
parse the ICMP packet to try and pin down the exact connection and
call.
(*) Encrypt/decrypt directly between user buffers and socket buffers where
possible.
(*) IPv6.
(*) Service ID upgrade. This is a facility whereby a special flag bit is
set in the DATA packet header when making a call that tells the server
that it is allowed to change the service ID to an upgraded one and
reply with an equivalent call from the upgraded service.
This is used, for example, to override certain AFS calls so that IPv6
addresses can be returned.
(*) Allow userspace to preallocate call user IDs for incoming calls.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit adf0516845 ("netfilter: remove ip_conntrack* sysctl
compat code"), ctl_table_path member in struct nf_conntrack_l3proto{}
is not used anymore, remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
ipsec-next 2016-09-08
1) Constify the xfrm_replay structures. From Julia Lawall
2) Protect xfrm state hash tables with rcu, lookups
can be done now without acquiring xfrm_state_lock.
From Florian Westphal.
3) Protect xfrm policy hash tables with rcu, lookups
can be done now without acquiring xfrm_policy_lock.
From Florian Westphal.
4) We don't need to have a garbage collector list per
namespace anymore, so use a global one instead.
From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rewrite the data and ack handling code such that:
(1) Parsing of received ACK and ABORT packets and the distribution and the
filing of DATA packets happens entirely within the data_ready context
called from the UDP socket. This allows us to process and discard ACK
and ABORT packets much more quickly (they're no longer stashed on a
queue for a background thread to process).
(2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
keep track of the offset and length of the content of each packet in
the sk_buff metadata. This means we don't do any allocation in the
receive path.
(3) Jumbo DATA packet parsing is now done in data_ready context. Rather
than cloning the packet once for each subpacket and pulling/trimming
it, we file the packet multiple times with an annotation for each
indicating which subpacket is there. From that we can directly
calculate the offset and length.
(4) A call's receive queue can be accessed without taking locks (memory
barriers do have to be used, though).
(5) Incoming calls are set up from preallocated resources and immediately
made live. They can than have packets queued upon them and ACKs
generated. If insufficient resources exist, DATA packet #1 is given a
BUSY reply and other DATA packets are discarded).
(6) sk_buffs no longer take a ref on their parent call.
To make this work, the following changes are made:
(1) Each call's receive buffer is now a circular buffer of sk_buff
pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
between the call and the socket. This permits each sk_buff to be in
the buffer multiple times. The receive buffer is reused for the
transmit buffer.
(2) A circular buffer of annotations (rxtx_annotations) is kept parallel
to the data buffer. Transmission phase annotations indicate whether a
buffered packet has been ACK'd or not and whether it needs
retransmission.
Receive phase annotations indicate whether a slot holds a whole packet
or a jumbo subpacket and, if the latter, which subpacket. They also
note whether the packet has been decrypted in place.
(3) DATA packet window tracking is much simplified. Each phase has just
two numbers representing the window (rx_hard_ack/rx_top and
tx_hard_ack/tx_top).
The hard_ack number is the sequence number before base of the window,
representing the last packet the other side says it has consumed.
hard_ack starts from 0 and the first packet is sequence number 1.
The top number is the sequence number of the highest-numbered packet
residing in the buffer. Packets between hard_ack+1 and top are
soft-ACK'd to indicate they've been received, but not yet consumed.
Four macros, before(), before_eq(), after() and after_eq() are added
to compare sequence numbers within the window. This allows for the
top of the window to wrap when the hard-ack sequence number gets close
to the limit.
Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
to indicate when rx_top and tx_top point at the packets with the
LAST_PACKET bit set, indicating the end of the phase.
(4) Calls are queued on the socket 'receive queue' rather than packets.
This means that we don't need have to invent dummy packets to queue to
indicate abnormal/terminal states and we don't have to keep metadata
packets (such as ABORTs) around
(5) The offset and length of a (sub)packet's content are now passed to
the verify_packet security op. This is currently expected to decrypt
the packet in place and validate it.
However, there's now nowhere to store the revised offset and length of
the actual data within the decrypted blob (there may be a header and
padding to skip) because an sk_buff may represent multiple packets, so
a locate_data security op is added to retrieve these details from the
sk_buff content when needed.
(6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
individually secured and needs to be individually decrypted. The code
to do this is broken out into rxrpc_recvmsg_data() and shared with the
kernel API. It now iterates over the call's receive buffer rather
than walking the socket receive queue.
Additional changes:
(1) The timers are condensed to a single timer that is set for the soonest
of three timeouts (delayed ACK generation, DATA retransmission and
call lifespan).
(2) Transmission of ACK and ABORT packets is effected immediately from
process-context socket ops/kernel API calls that cause them instead of
them being punted off to a background work item. The data_ready
handler still has to defer to the background, though.
(3) A shutdown op is added to the AF_RXRPC socket so that the AFS
filesystem can shut down the socket and flush its own work items
before closing the socket to deal with any in-progress service calls.
Future additional changes that will need to be considered:
(1) Make sure that a call doesn't hog the front of the queue by receiving
data from the network as fast as userspace is consuming it to the
exclusion of other calls.
(2) Transmit delayed ACKs from within recvmsg() when we've consumed
sufficiently more packets to avoid the background work item needing to
run.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need. The idea
is to cut out the background thread usage as much as possible.
[Note that the preallocated structs are not actually used in this patch -
that will be done in a future patch.]
If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.
To this end:
(1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
maximum number each of the listen backlog size. The backlog size is
limited to a maxmimum of 32. Only this many of each can be in the
preallocation buffer.
(2) For userspace sockets, the preallocation is charged initially by
listen() and will be recharged by accepting or rejecting pending
new incoming calls.
(3) For kernel services {,re,dis}charging of the preallocation buffers is
handled manually. Two notifier callbacks have to be provided before
kernel_listen() is invoked:
(a) An indication that a new call has been instantiated. This can be
used to trigger background recharging.
(b) An indication that a call is being discarded. This is used when
the socket is being released.
A function, rxrpc_kernel_charge_accept() is called by the kernel
service to preallocate a single call. It should be passed the user ID
to be used for that call and a callback to associate the rxrpc call
with the kernel service's side of the ID.
(4) Discard the preallocation when the socket is closed.
(5) Temporarily bump the refcount on the call allocated in
rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
preallocation ref on service calls unconditionally. This will no
longer be necessary once the preallocation is used.
Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.
A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer. This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a tracepoint for working out where local aborts happen. Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.
rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.
Signed-off-by: David Howells <dhowells@redhat.com>
When deleting an IP address from an interface, there is a clean-up of
routes which refer to this local address. However, there was no check to
see that the VRF matched. This meant that deletion wasn't confined to
the VRF it should have been.
To solve this, a new field has been added to fib_info to hold a table
id. When removing fib entries corresponding to a local ip address, this
table id is also used in the comparison.
The table id is populated when the fib_info is created. This was already
done in some places, but not in ip_rt_ioctl(). This has now been fixed.
Fixes: 021dd3b8a1 ("net: Add routes to the table associated with the device")
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.
The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking. The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.
We tried to fix this earlier with commit c845acb324 ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.
Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.
Acked-by: Rainer Weikusat <rweikusat@cyberadapt.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the retun value of switchdev_port_fdb_dump() when
CONFIG_NET_SWITCHDEV is not set. This avoids getting "warning: return makes
integer from pointer without a cast [-Wint-conversion]" when building
when CONFIG_NET_SWITCHDEV is not set under several compiler versions.
This warning is due to commit d297653dd6
("rtnetlink: fdb dump: optimize by saving last interface markers").
Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Access the priv member of the dsa_switch structure directly, instead of
having an unnecessary helper.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.
To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.
In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.
Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065
real 1m11.791s
user 0m0.070s
sys 1m8.395s
(existing code does not return all macs)
after patch:
$time bridge fdb show | wc -l
35085
real 0m2.017s
user 0m0.113s
sys 0m1.942s
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the const for the parameter of flow_keys_have_l4 for the readability.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.
This makes the following possibilities more achievable:
(1) Call refcounting can be made simpler if skbs don't hold refs to calls.
(2) skbs referring to non-data events will be able to be freed much sooner
rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
will be able to consult the call state.
(3) We can shortcut the receive phase when a call is remotely aborted
because we don't have to go through all the packets to get to the one
cancelling the operation.
(4) It makes it easier to do encryption/decryption directly between AFS's
buffers and sk_buffs.
(5) Encryption/decryption can more easily be done in the AFS's thread
contexts - usually that of the userspace process that issued a syscall
- rather than in one of rxrpc's background threads on a workqueue.
(6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
To make this work, the following interface function has been added:
int rxrpc_kernel_recv_data(
struct socket *sock, struct rxrpc_call *call,
void *buffer, size_t bufsize, size_t *_offset,
bool want_more, u32 *_abort_code);
This is the recvmsg equivalent. It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.
afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them. They don't wait synchronously yet because the socket
lock needs to be dealt with.
Five interface functions have been removed:
rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()
As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user. To process the queue internally, a temporary function,
temp_deliver_data() has been added. This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.
To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)
This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.
This includes IPV6 and some mtu fixes and testing from David Ahern.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Allow nf_tables reject expression from input, forward and output hooks,
since only there the routing information is available, otherwise we crash.
2) Fix unsafe list iteration when flushing timeout and accouting objects.
3) Fix refcount leak on timeout policy parsing failure.
4) Unlink timeout object for unconfirmed conntracks too
5) Missing validation of pkttype mangling from bridge family.
6) Fix refcount leak on ebtables on second lookup for the specific
bridge match extension, this patch from Sabrina Dubroca.
7) Remove unnecessary ip_hdr() in nf_tables_netdev family.
Patches from 1-5 and 7 from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a recent wext patch, which Ben Hutchings noticed was
wrong, and it turns out not to be necessary for any driver
* fix an infinite loop that can occur under certain conditions
in mac80211's TDLS code (depending on regulatory information)
* add a cfg80211_get_station() static inline when cfg80211 isn't
built, to allow other modules to not have to depend on it for it
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXxSR4AAoJEGt7eEactAAdl3EP/AiUwqrYqbLnnFy6C7obFS3p
eBBMxQAZbT+q+fFlZvqRrt5tPdkYriPLhm/0sAzuapnyS+Q6seNJ/vPoo91uC1jU
ZI/j97v9NwUtRLfNCq+0Jwvs7ma0U1VEcPV9wDdV5JgnKk0Z1CUIcsErYr1+v0YQ
EpRwxczhzJNTULW36UP7RvVQpxwIGldPhxSZ0t1uHWaYTFliaTlnJUAk0ql44Lmm
WLvoMSjFgX99P11ToCe81MPEzF2IXILvxPwtNZmn5tldEN2xknKEoEmmbN65fYDf
OIJIJ3s1CijQvnkgXtU0RWWCMnyOoJjsLckgSDdy0euhbS5xRIfxBN2n+kqaI9WV
a/aIvWNNhvAy2vNdWUJk0FrVBnDjlTtG1afIEAgJyP7uxTQqepQfyaRENLtH+kKe
lWbOITUZztyagGIn8Bv1pDrrqwO+fSjiEsVEVAQMMmNKBpUWf8urhDQmabCLYGDB
Nxh2e3wjv5ZQ+55uJIGRDCcPIrddh86FVtQBqTID+86r4a1RwPaWfhzFZYVj84fg
504UzwtYlw1ITUhGbdMwribLVkwtBMVuEvpPrh6avwzS8wAH4upkhp4GGl/tfd/Z
De0LqpxCKbDiI+VmmDo8FD4nx4wu4nTYIaecLjNoUSXhbjbPyI6V8/hIXqKiQUA3
ObkKlGicZJmhMa4zna0q
=dFzv
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Three little fixes:
* revert a recent wext patch, which Ben Hutchings noticed was
wrong, and it turns out not to be necessary for any driver
* fix an infinite loop that can occur under certain conditions
in mac80211's TDLS code (depending on regulatory information)
* add a cfg80211_get_station() static inline when cfg80211 isn't
built, to allow other modules to not have to depend on it for it
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass struct socket * to more rxrpc kernel interface functions. They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.
I have left:
rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()
unmodified as they're all about to be removed (and, in any case, don't
touch the socket).
Signed-off-by: David Howells <dhowells@redhat.com>
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:
void rxrpc_kernel_get_peer(struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.
Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.
Signed-off-by: David Howells <dhowells@redhat.com>
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.
Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').
Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.
During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.
The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.
Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.
This is the standard/expected way when conntrack entries are destroyed.
Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.
Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.
If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.
Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.
Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).
We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension. The worker can then skip all entries that are in
a different state. Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.
Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.
Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows modules using this function (currently: batman-adv) to
compile even if cfg80211 is not built at all, thus relaxing
dependencies.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.
While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.
Cause of the drops is the skb->truesize overestimation caused by :
- drivers allocating ~2048 (or more) bytes as a fragment to hold an
Ethernet frame.
- various pskb_may_pull() calls bringing the headers into skb->head
might have pulled all the frame content, but skb->truesize could
not be lowered, as the stack has no idea of each fragment truesize.
The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.
Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new function in proto_ops structure. This includes moving the
typedef got sk_read_actor into net.h and removing the definition from
tcp.h.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.
It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.
This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.
The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.
Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).
Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused and useless priv_size member from struct devlink_ops.
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.
This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"meta pkttype set" is only supported on prerouting chain with bridge
family and ingress chain with netdev family.
But the validate check is incomplete, and the user can add the nft
rules on input chain with bridge family, for example:
# nft add table bridge filter
# nft add chain bridge filter input {type filter hook input \
priority 0 \;}
# nft add chain bridge filter test
# nft add rule bridge filter test meta pkttype set unicast
# nft add rule bridge filter input jump test
This patch fixes the problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After I add the nft rule "nft add rule filter prerouting reject
with tcp reset", kernel panic happened on my system:
NULL pointer dereference at ...
IP: [<ffffffff81b9db2f>] nf_send_reset+0xaf/0x400
Call Trace:
[<ffffffff81b9da80>] ? nf_reject_ip_tcphdr_get+0x160/0x160
[<ffffffffa0928061>] nft_reject_ipv4_eval+0x61/0xb0 [nft_reject_ipv4]
[<ffffffffa08e836a>] nft_do_chain+0x1fa/0x890 [nf_tables]
[<ffffffffa08e8170>] ? __nft_trace_packet+0x170/0x170 [nf_tables]
[<ffffffffa06e0900>] ? nf_ct_invert_tuple+0xb0/0xc0 [nf_conntrack]
[<ffffffffa07224d4>] ? nf_nat_setup_info+0x5d4/0x650 [nf_nat]
[...]
Because in the PREROUTING chain, routing information is not exist,
then we will dereference the NULL pointer and oops happen.
So we restrict reject expression to INPUT, FORWARD and OUTPUT chain.
This is consistent with iptables REJECT target.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that the dsa_switch_driver structure contains only function pointers
as it is supposed to, rename it to the more appropriate dsa_switch_ops,
uniformly to any other operations structure in the kernel.
No functional changes here, basically just the result of something like:
s/dsa_switch_driver *drv/dsa_switch_ops *ops/g
However keep the {un,}register_switch_driver functions and their
dsa_switch_drivers list as is, since they represent the -- likely to be
deprecated soon -- legacy DSA registration framework.
In the meantime, also fix the following checks from checkpatch.pl to
make it happy with this patch:
CHECK: Comparison to NULL could be written "!ops"
#403: FILE: net/dsa/dsa.c:470:
+ if (ops == NULL) {
CHECK: Comparison to NULL could be written "ds->ops->get_strings"
#773: FILE: net/dsa/slave.c:697:
+ if (ds->ops->get_strings != NULL)
CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
#824: FILE: net/dsa/slave.c:785:
+ if (ds->ops->get_ethtool_stats != NULL)
CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
#835: FILE: net/dsa/slave.c:798:
+ if (ds->ops->get_sset_count != NULL)
total: 0 errors, 0 warnings, 4 checks, 784 lines checked
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 5b8ef3415a
("xfrm: Remove ancient sleeping when the SA is in acquire state")
gc does not need any per-netns data anymore.
As far as gc is concerned all state structs are the same, so we
can use a global work struct for it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We no longer use this handler, we can delete it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we no longer use SLAB_DESTROY_BY_RCU for UDP,
we do not need sk_prot_clear_portaddr_nulls() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for UDP sockets similar to what was done
for TCP with commit c1e64e298b ("net: diag: Support destroying TCP
sockets.") A process with a UDP socket targeted for destroy is awakened
and recvmsg fails with ECONNABORTED.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
Fast Open development. Remove this config option and also
update/clean-up the documentation of the Fast Open sysctl.
Reported-by: Piotr Jurkiewicz <piotr.jerzy.jurkiewicz@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the upper layer unpauses a stream parser connection we need to
queue rx_work to make sure no events are missed.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA drivers may drive different families of switches which need
different tag protocol. Rather than hard code the tag protocol in the
driver structure, have a callback for the DSA core to call.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 22dc13c837 ("net_sched: convert tcf_exts from list to pointer array")
we do dynamic allocation in tcf_exts_init(), therefore we need
to handle the ENOMEM case properly.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for allowing switch drivers to implement system-wide
suspend/resume functions, export dsa_switch_suspend and
dsa_switch_resume() such that these are callable from the appropriate
driver specific suspend/resume functions.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As recently discussed during the task_under_cgroup_hierarchy() addition,
we should get rid of the ifdefs surrounding the bpf_skb_under_cgroup()
helper. If related functionality is not built-in, the helper cannot be
used anyway, which is also in line with what we do for all other helpers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current vlan push action supports only vid and protocol options.
Add priority option.
Example script that adds vlan push action with vid and
priority:
tc filter add dev veth0 protocol ip parent ffff: \
flower \
indev veth0 \
action vlan push id 100 priority 5
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add vlan priority check to the flow dissector by adding new flow
dissector struct, flow_dissector_key_vlan which includes vlan tag
fields.
vlan_id and flow_label fields were under the same struct
(flow_dissector_key_tags). It was a convenient setting since struct
flow_dissector_key_tags is used by struct flow_keys and by setting
vlan_id and flow_label under the same struct, we get precisely 24 or 48
bytes in flow_keys from flow_dissector_key_basic.
Now, when adding vlan priority support, the code will be cleaner if
flow_label and vlan tag won't be under the same struct anymore.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes for both merge conflicts.
Resolution work done by Stephen Rothwell was used
as a reference.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
Fietkau.
2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.
3) Fix some tg3 ethtool logic bugs, and one that would cause no
interrupts to be generated when rx-coalescing is set to 0. From
Satish Baddipadige and Siva Reddy Kallam.
4) QLCNIC mailbox corruption and napi budget handling fix from Manish
Chopra.
5) Fix fib_trie logic when walking the trie during /proc/net/route
output than can access a stale node pointer. From David Forster.
6) Several sctp_diag fixes from Phil Sutter.
7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.
8) Checksum fixup fixes in bpf from Daniel Borkmann.
9) Memork leaks in nfnetlink, from Liping Zhang.
10) Use after free in rxrpc, from David Howells.
11) Use after free in new skb_array code of macvtap driver, from Jason
Wang.
12) Calipso resource leak, from Colin Ian King.
13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.
14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.
15) Fix lockdep splats in macsec, from Sabrina Dubroca.
16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
handling.
17) Various tc-action bug fixes, from CONG Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net_sched: allow flushing tc police actions
net_sched: unify the init logic for act_police
net_sched: convert tcf_exts from list to pointer array
net_sched: move tc offload macros to pkt_cls.h
net_sched: fix a typo in tc_for_each_action()
net_sched: remove an unnecessary list_del()
net_sched: remove the leftover cleanup_a()
mlxsw: spectrum: Allow packets to be trapped from any PG
mlxsw: spectrum: Unmap 802.1Q FID before destroying it
mlxsw: spectrum: Add missing rollbacks in error path
mlxsw: reg: Fix missing op field fill-up
mlxsw: spectrum: Trap loop-backed packets
mlxsw: spectrum: Add missing packet traps
mlxsw: spectrum: Mark port as active before registering it
mlxsw: spectrum: Create PVID vPort before registering netdevice
mlxsw: spectrum: Remove redundant errors from the code
mlxsw: spectrum: Don't return upon error in removal path
i40e: check for and deal with non-contiguous TCs
ixgbe: Re-enable ability to toggle VLAN filtering
ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
...
Adapt KCM to use the stream parser. This mostly involves removing
the RX handling and setting up the strparser using the interface.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a utility for parsing application layer protocol
messages in a TCP stream. This is a generalization of the mechanism
implemented of Kernel Connection Multiplexor.
The API includes a context structure, a set of callbacks, utility
functions, and a data ready function.
A stream parser instance is defined by a strparse structure that
is bound to a TCP socket. The function to initialize the structure
is:
int strp_init(struct strparser *strp, struct sock *csk,
struct strp_callbacks *cb);
csk is the TCP socket being bound to and cb are the parser callbacks.
The upper layer calls strp_tcp_data_ready when data is ready on the lower
socket for strparser to process. This should be called from a data_ready
callback that is set on the socket:
void strp_tcp_data_ready(struct strparser *strp);
A parser is bound to a TCP socket by setting data_ready function to
strp_tcp_data_ready so that all receive indications on the socket
go through the parser. This is assumes that sk_user_data is set to
the strparser structure.
There are four callbacks.
- parse_msg is called to parse the message (returns length or error).
- rcv_msg is called when a complete message has been received
- read_sock_done is called when data_ready function exits
- abort_parser is called to abort the parser
The input to parse_msg is an skbuff which contains next message under
construction. The backend processing of parse_msg will parse the
application layer protocol headers to determine the length of
the message in the stream. The possible return values are:
>0 : indicates length of successfully parsed message
0 : indicates more data must be received to parse the message
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
other < 0 : Error is parsing, give control back to userspace
assuming that synchronzation is lost and the stream
is unrecoverable (application expected to close TCP socket)
In the case of error return (< 0) strparse will stop the parser
and report and error to userspace. The application must deal
with the error. To handle the error the strparser is unbound
from the TCP socket. If the error indicates that the stream
TCP socket is at recoverable point (ESTRPIPE) then the application
can read the TCP socket to process the stream. Once the application
has dealt with the exceptions in the stream, it may again bind the
socket to a strparser to continue data operations.
Note that ENODATA may be returned to the application. In this case
parse_msg returned -ESTRPIPE, however strparser was unable to maintain
synchronization of the stream (i.e. some of the message in question
was already read by the parser).
strp_pause and strp_unpause are used to provide flow control. For
instance, if rcv_msg is called but the upper layer can't immediately
consume the message it can hold the message and pause strparser.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed out by Jamal, an action could be shared by
multiple filters, so we can't use list to chain them
any more after we get rid of the original tc_action.
Instead, we could just save pointers to these actions
in tcf_exts, since they are refcount'ed, so convert
the list to an array of pointers.
The "ugly" part is the action API still accepts list
as a parameter, I just introduce a helper function to
convert the array of pointers to a list, instead of
relying on the C99 feature to iterate the array.
Fixes: a85a970af2 ("net_sched: move tc_action into tcf_common")
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tcf_exts belongs to filters, should not be visible
to plain tc actions.
Cc: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is harmless because all users pass 'a' to this macro.
Fixes: 00175aec94 ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
Cc: Amir Vadai <amir@vadai.me>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 64b87639c9 ("netfilter: conntrack: fix race between
nf_conntrack proc read and hash resize") introduce the
nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation
again and again to get the hash table and hash size. And convert
nf_conntrack_get_ht to inline function here.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ensure that the inner_protocol is set on transmit so that GSO segmentation,
which relies on that field, works correctly.
This is achieved by setting the inner_protocol in gre_build_header rather
than each caller of that function. It ensures that the inner_protocol is
set when gre_fb_xmit() is used to transmit GRE which was not previously the
case.
I have observed this is not the case when OvS transmits GRE using
lwtunnel metadata (which it always does).
Fixes: 3872035241 ("gre: Use inner_proto to obtain inner header protocol")
Cc: Pravin Shelar <pshelar@ovn.org>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Use struct gre_base_hdr directly in pptp_gre_header instead of
duplicated members;
2. Use existing macros like GRE_KEY, GRE_SEQ, and so on instead of
duplicated macros defined by PPTP;
3. Add new macros like GRE_IS_ACK/SEQ and so on instead of
PPTP_GRE_IS_A/S and so on;
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Reviewed-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the correct type __wsum to csum_sub() and csum_add(). This doesn't
really change anything since __wsum really *is* __be32, but removes the
address space warnings from sparse.
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 34ae6a1aa0 ("ipv6: update skb->csum when CE mark is propagated")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This backward compatibility has been around for more than ten years,
since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have
alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and
the conntrack utility got adopted by many people in the user community
according to what I observed on the netfilter user mailing list.
So let's get rid of this.
Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do
not need to be exported as symbol anymore.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After earlier patches conversions all spots acquire the writer lock and
we can now convert this to a normal spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The PPTP is encapsulated by GRE header with that GRE_VERSION bits
must contain one. But current GRE RPS needs the GRE_VERSION must be
zero. So RPS does not work for PPTP traffic.
In my test environment, there are four MIPS cores, and all traffic
are passed through by PPTP. As a result, only one core is 100% busy
while other three cores are very idle. After this patch, the usage
of four cores are balanced well.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Reviewed-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.
The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
push the lock down, after earlier patches we can rely on rcu to
make sure state struct won't go away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The xfrm_replay structures are never modified, so declare them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
After commit 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
fib_local is set but not used. Remove it.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix 80+80 bandwidth warning
* fix powersave with mac80211 TXQ implementation
* use correct way to free SKBs from multicast buffering
* mesh: fix operation ordering to work with all drivers
* mesh: end service period even when peer goes away
* mesh: correct HT opmode validity checks
* pass hw pointer from mac80211 to driver in TPT method,
fixing a bug (in a bit the wrong way, but that's what
we have right now)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXpIcJAAoJEGt7eEactAAdSFUP/0zeMBnYsxm0UYFPKOYf7+rF
P9s88XRpYNiTQqA5YgkaoiSaORMrdj9AeSTIDJ1MDOHVJSQ3jBbmmWUlM7h+VNQw
P6YQp4xw+yxQeB2Lobb0E/7lxpG5nRKFtbPMkDasSJv+0fzGTqm68Cpjs7IMjfOw
+I7ZjWZzClZdpTS4avyziEbpxAdSvJqf9SczLeDw7BjbufsSWKNT8yBPeTNa0Mfz
IVzKh84eEyHBWQqWhqNclA4QMqQPoTQQ1YYqG1lmc8Jiq7/9y5pImedlNyHkiwgY
t4vh7tFEL1HtWKiq9nbO7fSFkZqJHVyNSpdrQSxsx3FFYkcoOEZu0GbeWQhwXr/s
a1l91GgNoH4Sv9xn3YRVPT+1RygzzGR6MUuNiU9DTSdohg+BBscSSBXm7op39H+Z
z+X7z6a1mQAujfCbW1mNJ2Ajymr2RfEAXHRTUo/8/4Y86+wIbTe1vk0jqgkHOIV2
9Z1Nt/83iP12ON5s1Tnh1H619Pv+UXxujMV3plWPeaPTxG3F34Xnpsnw2AE1cAZ5
Mu0sMMfh9w2rPo5miPyMpU7dJo2mY95qC/+aosZlbeAMyEPqRtSE3sHLzEkUyuJI
5VskVEIBYukIahsRN9Qd9FldQNwcZuqFpo43qkDYkE67Q3/oNokAlMb9SWv/V6D4
FQmZbX1DcL+iYlAx8rN7
=4hWm
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-08-05' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
First set of fixes for the current cycle:
* fix 80+80 bandwidth warning
* fix powersave with mac80211 TXQ implementation
* use correct way to free SKBs from multicast buffering
* mesh: fix operation ordering to work with all drivers
* mesh: end service period even when peer goes away
* mesh: correct HT opmode validity checks
* pass hw pointer from mac80211 to driver in TPT method,
fixing a bug (in a bit the wrong way, but that's what
we have right now)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- New vsock device support in host and guest
- Platform IOMMU support in host and guest,
including compatibility quirks for legacy systems.
- Misc fixes and cleanups.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXofvbAAoJECgfDbjSjVRpUTIH/iEoK9h636tBayXy0PXkPby0
6fMaRFy6H1HgEttgDhJE8Pqg/ba3qaW9Em0fHyFq7Mp2waFHAZ8hAT8phC6TAK3c
CIBnfzyyuI8u3N9SnNOfelPVcwCBfuALuuTsXB/rwKbYQEVv+U5Rdt3Vyx9+lXkj
P005klz7PfqxFhQrrnj4Eh7VawtHwmMuLH8YoWpCZpM71dHPo6eL+3ftKwhH2boo
qK86uVprwba03Pewpm13vQnotemfVfUUkjXd4EJpG3dx7E0KZosuj0ZG9OV8mPGQ
Cl2gBdUhocdJgeUnAHmf6tumYi9KFlYfy6xLy44YMmN7FL3E9nQjaKZp25UKfiM=
=ztIm
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost updates from Michael Tsirkin:
- new vsock device support in host and guest
- platform IOMMU support in host and guest, including compatibility
quirks for legacy systems.
- misc fixes and cleanups.
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
VSOCK: Use kvfree()
vhost: split out vringh Kconfig
vhost: detect 32 bit integer wrap around
vhost: new device IOTLB API
vhost: drop vringh dependency
vhost: convert pre sorted vhost memory array to interval tree
vhost: introduce vhost memory accessors
VSOCK: Add Makefile and Kconfig
VSOCK: Introduce vhost_vsock.ko
VSOCK: Introduce virtio_transport.ko
VSOCK: Introduce virtio_vsock_common.ko
VSOCK: defer sock removal to transports
VSOCK: transport-specific vsock_transport functions
vhost: drop vringh dependency
vop: pull in vhost Kconfig
virtio: new feature to detect IOMMU device quirk
balloon: check the number of available pages in leak balloon
vhost: lockless enqueuing
vhost: simplify work flushing
Inside the kafs filesystem it is possible to occasionally have a call
processed and terminated before we've had a chance to check whether we need
to clean up the rx queue for that call because afs_send_simple_reply() ends
the call when it is done, but this is done in a workqueue item that might
happen to run to completion before afs_deliver_to_call() completes.
Further, it is possible for rxrpc_kernel_send_data() to be called to send a
reply before the last request-phase data skb is released. The rxrpc skb
destructor is where the ACK processing is done and the call state is
advanced upon release of the last skb. ACK generation is also deferred to
a work item because it's possible that the skb destructor is not called in
a context where kernel_sendmsg() can be invoked.
To this end, the following changes are made:
(1) kernel_rxrpc_data_consumed() is added. This should be called whenever
an skb is emptied so as to crank the ACK and call states. This does
not release the skb, however. kernel_rxrpc_free_skb() must now be
called to achieve that. These together replace
rxrpc_kernel_data_delivered().
(2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().
This makes afs_deliver_to_call() easier to work as the skb can simply
be discarded unconditionally here without trying to work out what the
return value of the ->deliver() function means.
The ->deliver() functions can, via afs_data_complete(),
afs_transfer_reply() and afs_extract_data() mark that an skb has been
consumed (thereby cranking the state) without the need to
conditionally free the skb to make sure the state is correct on an
incoming call for when the call processor tries to send the reply.
(3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
has finished with a packet and MSG_PEEK isn't set.
(4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().
Because of this, we no longer need to clear the destructor and put the
call before we free the skb in cases where we don't want the ACK/call
state to be cranked.
(5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
the delivery function already), and the caller is now responsible for
producing an abort if that was the last packet.
(6) There are many bits of unmarshalling code where:
ret = afs_extract_data(call, skb, last, ...);
switch (ret) {
case 0: break;
case -EAGAIN: return 0;
default: return ret;
}
is to be found. As -EAGAIN can now be passed back to the caller, we
now just return if ret < 0:
ret = afs_extract_data(call, skb, last, ...);
if (ret < 0)
return ret;
(7) Checks for trailing data and empty final data packets has been
consolidated as afs_data_complete(). So:
if (skb->len > 0)
return -EBADMSG;
if (!last)
return 0;
becomes:
ret = afs_data_complete(call, skb, last);
if (ret < 0)
return ret;
(8) afs_transfer_reply() now checks the amount of data it has against the
amount of data desired and the amount of data in the skb and returns
an error to induce an abort if we don't get exactly what we want.
Without these changes, the following oops can occasionally be observed,
particularly if some printks are inserted into the delivery path:
general protection fault: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: kafsd afs_async_workfn [kafs]
task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
RIP: 0010:[<ffffffff8108fd3c>] [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
Stack:
0000000000000006 000000000be04930 0000000000000000 ffff880400000000
ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
Call Trace:
[<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
[<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
[<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
[<ffffffff810915f4>] lock_acquire+0x122/0x1b6
[<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
[<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
[<ffffffff814c928f>] skb_dequeue+0x18/0x61
[<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
[<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
[<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
[<ffffffff81063a3a>] process_one_work+0x29d/0x57c
[<ffffffff81064ac2>] worker_thread+0x24a/0x385
[<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
[<ffffffff810696f5>] kthread+0xf3/0xfb
[<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
[<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable is added to allow the driver an easy access to
it's own hw->priv when the op is invoked.
This fixes a crash in wlcore because it was relying on a
station pointer that wasn't initialized yet. It's the wrong
way to fix the crash, but it solves the problem for now and
it does make sense to have the hw pointer here.
Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
[rewrite commit message, fix indentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There was only one use of __initdata_refok and __exit_refok
__init_refok was used 46 times against 82 for __ref.
Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")
This patch removes the following compatibility definitions and replaces
them treewide.
/* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __ref
I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This module contains the common code and header files for the following
virtio_transporto and vhost_vsock kernel modules.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The virtio transport will implement graceful shutdown and the related
SO_LINGER socket option. This requires orphaning the sock but keeping
it in the table of connections after .release().
This patch adds the vsock_remove_sock() function and leaves it up to the
transport when to remove the sock.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
struct vsock_transport contains function pointers called by AF_VSOCK
core code. The transport may want its own transport-specific function
pointers and they can be added after struct vsock_transport.
Allow the transport to fetch vsock_transport. It can downcast it to
access transport-specific function pointers.
The virtio transport will use this.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Prior to this patch, sctp defined TCP_CLOSING as SCTP_SS_CLOSING.
TCP_CLOSING is such a special sk state in TCP that inet common codes
even exclude it.
For instance, inet_accept thinks the accept sk's state never be
TCP_CLOSING, or it will give a WARN_ON. TCP works well with that
while SCTP may trigger the call trace, as CLOSING state in SCTP
has different meaning from TCP.
This fix is to change to use TCP_CLOSE_WAIT as SCTP_SS_CLOSING,
instead of TCP_CLOSING. Some side-effects could be expected,
regardless of not being used before. inet_accept will accept it
now.
I did all the func_tests in lksctp-tools and ran sctp codnomicon
fuzzer tests against this patch, no regression or failure found.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security subsystem updates from James Morris:
"Highlights:
- TPM core and driver updates/fixes
- IPv6 security labeling (CALIPSO)
- Lots of Apparmor fixes
- Seccomp: remove 2-phase API, close hole where ptrace can change
syscall #"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
tpm: Factor out common startup code
tpm: use devm_add_action_or_reset
tpm2_i2c_nuvoton: add irq validity check
tpm: read burstcount from TPM_STS in one 32-bit transaction
tpm: fix byte-order for the value read by tpm2_get_tpm_pt
tpm_tis_core: convert max timeouts from msec to jiffies
apparmor: fix arg_size computation for when setprocattr is null terminated
apparmor: fix oops, validate buffer size in apparmor_setprocattr()
apparmor: do not expose kernel stack
apparmor: fix module parameters can be changed after policy is locked
apparmor: fix oops in profile_unpack() when policy_db is not present
apparmor: don't check for vmalloc_addr if kvzalloc() failed
apparmor: add missing id bounds check on dfa verification
apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
apparmor: use list_next_entry instead of list_entry_next
apparmor: fix refcount race when finding a child profile
apparmor: fix ref count leak when profile sha1 hash is read
apparmor: check that xindex is in trans_table bounds
...
After the previous patch, struct tc_action should be enough
to represent the generic tc action, tcf_common is not necessary
any more. This patch gets rid of it to make tc action code
more readable.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tc_action is confusing, currently we use it for two purposes:
1) Pass in arguments and carry out results from helper functions
2) A generic representation for tc actions
The first one is error-prone, since we need to make sure we don't
miss anything. This patch aims to get rid of this use, by moving
tc_action into tcf_common, so that they are allocated together
in hashtable and can be cast'ed easily.
And together with the following patch, we could really make
tc_action a generic representation for all tc actions and each
type of action can inherit from it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_NET_CLS_ACT isn't set 'struct tcf_exts' has no member named
'actions' and we therefore must not access it. Otherwise compilation
fails.
Fix this by introducing a new macro similar to tc_no_actions(), which
always returns 'false' if CONFIG_NET_CLS_ACT isn't set.
Fixes: 763b4b70af ("mlxsw: spectrum: Add support in matchall mirror TC offloading")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix clang build warning:
./include/net/gtp.h:1:9: warning: '_GTP_H_' is used as a header
guard here, followed by #define of a different macro [-Wheader-guard]
fix by defining _GTP_H_ and not _GTP_H
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The helper function is_tcf_mirred_mirror helps finding whether an action
struct is of type mirred and is configured to be of type mirror.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the work that have been done on offloading classifiers like u32
and flower, now the match-all classifier hw offloading is possible. if
the interface supports tc offloading.
To control the offloading, two tc flags have been introduced: skip_sw and
skip_hw. Typical usage:
tc filter add dev eth25 parent ffff: \
matchall skip_sw \
action mirred egress mirror \
dev eth27
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
they are:
1) Count pre-established connections as active in "least connection"
schedulers such that pre-established connections to avoid overloading
backend servers on peak demands, from Michal Kubecek via Simon Horman.
2) Address a race condition when resizing the conntrack table by caching
the bucket size when fulling iterating over the hashtable in these
three possible scenarios: 1) dump via /proc/net/nf_conntrack,
2) unlinking userspace helper and 3) unlinking custom conntrack timeout.
From Liping Zhang.
3) Revisit early_drop() path to perform lockless traversal on conntrack
eviction under stress, use del_timer() as synchronization point to
avoid two CPUs evicting the same entry, from Florian Westphal.
4) Move NAT hlist_head to nf_conn object, this simplifies the existing
NAT extension and it doesn't increase size since recent patches to
align nf_conn, from Florian.
5) Use rhashtable for the by-source NAT hashtable, also from Florian.
6) Don't allow --physdev-is-out from OUTPUT chain, just like
--physdev-out is not either, from Hangbin Liu.
7) Automagically set on nf_conntrack counters if the user tries to
match ct bytes/packets from nftables, from Liping Zhang.
8) Remove possible_net_t fields in nf_tables set objects since we just
simply pass the net pointer to the backend set type implementations.
9) Fix possible off-by-one in h323, from Toby DiPasquale.
10) early_drop() may be called from ctnetlink patch, so we must hold
rcu read size lock from them too, this amends Florian's patch #3
coming in this batch, from Liping Zhang.
11) Use binary search to validate jump offset in x_tables, this
addresses the O(n!) validation that was introduced recently
resolve security issues with unpriviledge namespaces, from Florian.
12) Fix reference leak to connlabel in error path of nft_ct, from Zhang.
13) Three updates for nft_log: Fix log prefix leak in error path. Bail
out on loglevel larger than debug in nft_log and set on the new
NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang.
14) Allow to filter rule dumps in nf_tables based on table and chain
names.
15) Simplify connlabel to always use 128 bits to store labels and
get rid of unused function in xt_connlabel, from Florian.
16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack
helper, by Gao Feng.
17) Put back x_tables module reference in nft_compat on error, from
Liping Zhang.
18) Add a reference count to the x_tables extensions cache in
nft_compat, so we can remove them when unused and avoid a crash
if the extensions are rmmod, again from Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The conntrack label extension is currently variable-sized, e.g. if
only 2 labels are used by iptables rules then the labels->bits[] array
will only contain one element.
We track size of each label storage area in the 'words' member.
But in nftables and openvswitch we always have to ask for worst-case
since we don't know what bit will be used at configuration time.
As most arches are 64bit we need to allocate 24 bytes in this case:
struct nf_conn_labels {
u8 words; /* 0 1 */
/* XXX 7 bytes hole, try to pack */
long unsigned bits[2]; /* 8 24 */
Make bits a fixed size and drop the words member, it simplifies
the code and only increases memory requirements on x86 when
less than 64bit labels are required.
We still only allocate the extension if its needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
so that the caller can update stats accordingly, if needed
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXjp/3AAoJEIqAPN1PVmxKjpEP/2bgptSwf2cCql+Jv+YaLLny
PjKr+qBXxqvogclucaGg5iIFnVjJ/pHd+csCeqHWpT1jAK7DtK1IJZjg6K9nD7vO
OQQ49+oxcFLTTy00rbfCFzxRaDDnkhD/qafXkTomEMiH7QVK0qssaxm2FVFVEblW
1NTmcsEUnbDbccQhXnxh+N7Xt2CAgsMIbbyHM+4yQuqGtSYjFd164c3jTL13W4a5
SQEJZkCtI7DIdFd6SiXkTGNjlN/fqXuUqXsf2EHxdFb7EE0c067uHpudp2hFdAem
WmAYjjmIuTRFwRFKPJMLUakSen3XbBKVUbtDnIMYykNWYnC4CmedrCrX3YRw4GQt
hZgkj6o5IweSSf6foIgihurE6m5jqd2mAcauwYC/K9wW5nHLaKg8fd9gAngoWY7P
MKBOCyjqIPWkNDC5tne6qftpsDhCrBcdrAtbkorx0lHF20OFto7Gjzxx1Ca+fnJg
N9/fMulQJu8rz3FYpvfvogQMJjkjeFUfyZDa3/ft/ySU6qohxDwXOFaZ82lieTAo
PztTq8tY7GDrdJdyvvHx78RpRVCJT8qHzBRIiiZRpt9MM/aPSepLcozwM97WrJDa
sPvz0jol4d12VIy02j2ArPjMon1MrQePed+Y1y2OtBt8rGiSUxC94t5LE3aMqPs/
a9tNLZYL/nixpLeXbeWa
=yrVP
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.8 pull request
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nf_ct_helper_init(), nf_conntrack_helpers_register() and
nf_conntrack_helpers_unregister() functions to avoid repetitive
opencoded initialization in helpers.
This patch keeps an id parameter for nf_ct_helper_init() not to break
helper matching by name that has been inconsistently exposed to
userspace through ports, eg. ftp-2121, and through an incremental id,
eg. tftp-1.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-07-19
Here's likely the last bluetooth-next pull request for the 4.8 kernel:
- Fix for L2CAP setsockopt
- Fix for is_suspending flag handling in btmrvl driver
- Addition of Bluetooth HW & FW info fields to debugfs
- Fix to use int instead of char for callback status.
The last one (from Geert Uytterhoeven) is actually not purely a
Bluetooth (or 802.15.4) patch, but it was agreed with other maintainers
that we take it through the bluetooth-next tree.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This manages NCSI packages and channels:
* The available packages and channels are enumerated in the first
time of calling ncsi_start_dev(). The channels' capabilities are
probed in the meanwhile. The NCSI network topology won't change
until the NCSI device is destroyed.
* There in a queue in every NCSI device. The element in the queue,
channel, is waiting for configuration (bringup) or suspending
(teardown). The channel's state (inactive/active) indicates the
futher action (configuration or suspending) will be applied on the
channel. Another channel's state (invisible) means the requested
action is being applied.
* The hardware arbitration will be enabled if all available packages
and channels support it. All available channels try to provide
service when hardware arbitration is enabled. Otherwise, one channel
is selected as the active one at once.
* When channel is in active state, meaning it's providing service, a
timer started to retrieve the channe's link status. If the channel's
link status fails to be updated in the determined period, the channel
is going to be reconfigured. It's the error handling implementation
as defined in NCSI spec.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
NCSI spec (DSP0222) defines several objects: package, channel, mode,
filter, version and statistics etc. This introduces the data structs
to represent those objects and implement functions to manage them.
Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
stack.
* The user (e.g. netdev driver) dereference NCSI device by
"struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
The later one is used by NCSI stack internally.
* Every NCSI device can have multiple packages simultaneously, up
to 8 packages. It's represented by "struct ncsi_package" and
identified by 3-bits ID.
* Every NCSI package can have multiple channels, up to 32. It's
represented by "struct ncsi_channel" and identified by 5-bits ID.
* Every NCSI channel has version, statistics, various modes and
filters. They are represented by "struct ncsi_channel_version",
"struct ncsi_channel_stats", "struct ncsi_channel_mode" and
"struct ncsi_channel_filter" separately.
* Apart from AEN (Asynchronous Event Notification), the NCSI stack
works in terms of command and response. This introduces "struct
ncsi_req" to represent a complete NCSI transaction made of NCSI
request and response.
link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new function for DSA drivers to handle the switchdev
SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute.
The ageing time is passed as milliseconds.
Also because we can have multiple logical bridges on top of a physical
switch and ageing time are switch-wide, call the driver function with
the fastest ageing time in use on the chip instead of the requested one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev value for the SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME
attribute is a clock_t and requires to use helpers such as
clock_t_to_jiffies() to convert to milliseconds.
Change ageing_time type from u32 to clock_t to make it explicit.
Fixes: f55ac58ae6 ("switchdev: add bridge ageing_time attribute")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flag indicates whether fragmentation of segments is allowed.
Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Bluetooth controllers allow for reading hardware and firmware
related vendor specific infos. If they are available, then they can be
exposed via debugfs now.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The protoypes for hci_recv_frame and hci_recv_diag are in the wrong
location in the header file. Move them close to all the other hci_dev
related exported functions.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This helper serves to know if two switchdev port netdevices belong to the
same HW ASIC, e.g to figure out if forwarding offload is possible between them.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Identifying address family operations during rx path is not something
expensive but it's ugly to the eye to have it done multiple times,
specially when we already validated it during initial rx processing.
This patch takes advantage of the now shared sctp_input_cb and make the
pointer to the operations readily available.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP will try to access original IP headers on sctp_recvmsg in order to
copy the addresses used. There are also other places that do similar access
to IP or even SCTP headers. But after 90017accff ("sctp: Add GSO
support") they aren't always there because they are only present in the
header skb.
SCTP handles the queueing of incoming data by cloning the incoming skb
and limiting to only the relevant payload. This clone has its cb updated
to something different and it's then queued on socket rx queue. Thus we
need to fix this in two moments.
For rx path, not related to socket queue yet, this patch uses a
partially copied sctp_input_cb to such GSO frags. This restores the
ability to access the headers for this part of the code.
Regarding the socket rx queue, it removes iif member from sctp_event and
also add a chunk pointer on it.
With these changes we're always able to reach the headers again.
The biggest change here is that now the sctp_chunk struct and the
original skb are only freed after the application consumed the buffer.
Note however that the original payload was already like this due to the
skb cloning.
For iif, SCTP's IPv4 code doesn't use it, so no change is necessary.
IPv6 now can fetch it directly from original's IPv6 CB as the original
skb is still accessible.
In the future we probably can simplify sctp_v*_skb_iif() stuff, as
sctp_v4_skb_iif() was called but it's return value not used, and now
it's not even called, but such cleanup is out of scope for this change.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch needs 8 bytes in there. sctp_ulpevent has a hole due to
bad alignment; msg_flags is using 4 bytes while it actually uses only 2, so
we shrink it, and iif member (4 bytes) which can be easily fetched from
another place once the next patch is there, so we remove it and thus
creating space for 8 bytes.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We process input path in other files too and having access to it is
nice, so move it to a header where it's shared.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-07-13
Here's our main bluetooth-next pull request for the 4.8 kernel:
- Fixes and cleanups in 802.15.4 and 6LoWPAN code
- Fix out of bounds issue in btmrvl driver
- Fixes to Bluetooth socket recvmsg return values
- Use crypto_cipher_encrypt_one() instead of crypto_skcipher
- Cleanup of Bluetooth connection sysfs interface
- New Authentication failure reson code for Disconnected mgmt event
- New USB IDs for Atheros, Qualcomm and Intel Bluetooth controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dccp verifies packet integrity, including length, at initial rcv in
dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
skb_copy_datagram_msg interprets this as a negative value, so
(correctly) fails with EFAULT. The negative length is reported in
ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
Introduce an sk_receive_skb variant that caps how small a filter
program can trim packets, and call this in dccp with the header
length. Excessively trimmed packets are now processed normally and
queued for reception as 0B payloads.
Fixes: 7c657876b6 ("[DCCP]: Initial implementation")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree.
they are:
1) Fix leak in the error path of nft_expr_init(), from Liping Zhang.
2) Tracing from nf_tables cannot be disabled, also from Zhang.
3) Fix an integer overflow on 32bit archs when setting the number of
hashtable buckets, from Florian Westphal.
4) Fix configuration of ipvs sync in backup mode with IPv6 address,
from Quentin Armitage via Simon Horman.
5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION,
from Florian Westphal.
6) Skip clash resolution in conntrack insertion races if NAT is in
place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
prsctp PRIO policy is a policy to abandon lower priority chunks when
asoc doesn't have enough snd buffer, so that the current chunk with
higher priority can be queued successfully.
Similar to TTL/RTX policy, we will set the priority of the chunk to
prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy().
So if PRIO policy is enabled, msg->expire_at won't work.
asoc->sent_cnt_removable will record how many chunks can be checked to
remove. If priority policy is enabled, when the chunk is queued into
the out_queue, we will increase sent_cnt_removable. When the chunk is
moved to abandon_queue or dequeue and free, we will decrease
sent_cnt_removable.
In sctp_sendmsg, we will check if there is enough snd buffer for current
msg and if sent_cnt_removable is not 0. Then try to abandon chunks in
sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and
free chunks from out_queue in right order until the abandon+free size >
msg_len - sctp_wfree. For the abandon size, we have to wait until it
sends FORWARD TSN, receives the sack and the chunks are really freed.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
prsctp TTL policy is a policy to abandon chunks when they expire
at the specific time in local stack. It's similar with expires_at
in struct sctp_datamsg.
This patch uses sinfo->sinfo_timetolive to set the specific time for
TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at.
So if prsctp_enable or TTL policy is not enabled, msg->expires_at
still works as before.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used
to dump the prsctp statistics info from the asoc. The prsctp statistics
includes abandoned_sent/unsent from the asoc. abandoned_sent is the
count of the packets we drop packets from retransmit/transmited queue,
and abandoned_unsent is the count of the packets we drop from out_queue
according to the policy.
Note: another option for prsctp statistics dump described in rfc is
SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics
info from each stream. But by now, linux doesn't yet have per stream
statistics info, it needs rfc6525 to be implemented. As the prsctp
statistics for each stream has to be based on per stream statistics,
we will delay it until rfc6525 is done in linux.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to section 4.5 of rfc7496, prsctp_enable should be per asoc.
We will add prsctp_enable to both asoc and ep, and replace the places
where it used net.sctp->prsctp_enable with asoc->prsctp_enable.
ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and
asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can
also modify it's value through sockopt SCTP_PR_SUPPORTED.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can pass the netns pointer as parameter to the functions that need to
gain access to it. From basechains, I didn't find any client for this
field anymore so let's remove this too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It did use a fixed-size bucket list plus single lock to protect add/del.
Unlike the main conntrack table we only need to add and remove keys.
Convert it to rhashtable to get table autosizing and per-bucket locking.
The maximum number of entries is -- as before -- tied to the number of
conntracks so we do not need another upperlimit.
The change does not handle rhashtable_remove_fast error, only possible
"error" is -ENOENT, and that is something that can happen legitimetely,
e.g. because nat module was inserted at a later time and no src manip
took place yet.
Tested with http-client-benchmark + httpterm with DNAT and SNAT rules
in place.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nat extension structure is 32bytes in size on x86_64:
struct nf_conn_nat {
struct hlist_node bysource; /* 0 16 */
struct nf_conn * ct; /* 16 8 */
union nf_conntrack_nat_help help; /* 24 4 */
int masq_index; /* 28 4 */
/* size: 32, cachelines: 1, members: 4 */
/* last cacheline: 32 bytes */
};
The hlist is needed to quickly check for possible tuple collisions
when installing a new nat binding. Storing this in the extension
area has two drawbacks:
1. We need ct backpointer to get the conntrack struct from the extension.
2. When reallocation of extension area occurs we need to fixup the bysource
hash head via hlist_replace_rcu.
We can avoid both by placing the hlist_head in nf_conn and place nf_conn in
the bysource hash rather than the extenstion.
We can also remove the ->move support; no other extension needs it.
Moving the entire nat extension into nf_conn would be possible as well but
then we have to add yet another callback for deletion from the bysource
hash table rather than just using nat extension ->destroy hook for this.
nf_conn size doesn't increase due to aligment, followup patch replaces
hlist_node with single pointer.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We don't need to acquire the bucket lock during early drop, we can
use lockless traveral just like ____nf_conntrack_find.
The timer deletion serves as synchronization point, if another cpu
attempts to evict same entry, only one will succeed with timer deletion.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack
hash table via /sys/module/nf_conntrack/parameters/hashsize, race will
happen, because reader can observe a newly allocated hash but the old size
(or vice versa). So oops will happen like follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000017
IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack]
Call Trace:
[<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack]
[<ffffffff81261a1c>] seq_read+0x2cc/0x390
[<ffffffff812a8d62>] proc_reg_read+0x42/0x70
[<ffffffff8123bee7>] __vfs_read+0x37/0x130
[<ffffffff81347980>] ? security_file_permission+0xa0/0xc0
[<ffffffff8123cf75>] vfs_read+0x95/0x140
[<ffffffff8123e475>] SyS_read+0x55/0xc0
[<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4
It is very easy to reproduce this kernel crash.
1. open one shell and input the following cmds:
while : ; do
echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize
done
2. open more shells and input the following cmds:
while : ; do
cat /proc/net/nf_conntrack
done
3. just wait a monent, oops will happen soon.
The solution in this patch is based on Florian's Commit 5e3c61f981
("netfilter: conntrack: fix lookup race during hash resize"). And
add a wrapper function nf_conntrack_get_ht to get hash and hsize
suggested by Florian Westphal.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When sending an ATR_REQ, the initiator must wait for the ATR_RES at
least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than
'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This
gives a timeout value between 1237 ms and 1337 ms. This patch defines
DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ
command.
For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)'
and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by
the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value
of the TO field in the ATR_RES response and is in the range between 0
and 14. This patch declares a mapping table for wt values and gives RWT
max values between 100 ms and 5049 ms.
This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in
target mode, to 8.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch fixes the way an I-PDU is saved in case it needs to be sent
again. It is now copied using pskb_copy() and not simply referenced
using skb_get() since it could be modified by the driver.
digital_in_send_saved_skb() and digital_tg_send_saved_skb() still get a
reference on the saved skb which is re-sent but release it if the send
operation fails. That way the caller doesn't have to take care about skb
ref in case of error.
RTOX supervisor PDU must not be saved as this can override a previously
saved I-PDU that should be re-sent later on.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The HCI_BREDR naming is confusing since it actually stands for Primary
Bluetooth Controller. Which is a term that has been used in the latest
standard. However from a legacy point of view there only really have
been Basic Rate (BR) and Enhanced Data Rate (EDR). Recent versions of
Bluetooth introduced Low Energy (LE) and made this terminology a little
bit confused since Dual Mode Controllers include BR/EDR and LE. To
simplify this the name HCI_PRIMARY stands for the Primary Controller
which can be a single mode or dual mode controller.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The routing table of every switch in a tree is currently initialized to
all zeros. This is an issue since 0 is a valid port number.
Add a DSA_RTABLE_NONE=-1 constant to initialize the signed values of the
routing table pointing to other switches.
This fixes the device mapping of the mv88e6xxx driver where the port
pointing to the switch itself and to non-existent switches was wrongly
configured to be 0. It is now set to the expected 0xf value.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to compute timeout.expires - jiffies, not the other way around.
Add a helper, another patch can then later change more places in
conntrack code where we currently open-code this.
Will allow us to only change one place later when we remove per-ct timer.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch cleanups the WARN_ON which occurs when the sk buffer has
insufficient buffer space by moving the WARN_ON into if condition.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch fixes ieee802154_get_fc_from_skb function on big endian
machines. The function get_unaligned_le16 converts the byte order to
host byte order but we want to keep the byte order like in mac header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The RIOT-OS stack does send intra-pan frames but don't set the intra pan
flag inside the mac header. It seems this is valid frame addressing but
inefficient. Anyway this patch adds a new function for intra pan
addressing, doesn't matter if intra pan flag or source and destination
are the same. The newly introduction function will be used to check on
intra pan addressing for 6lowpan.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds ieee802154_skb_src_pan function to get the pointer
address of the source pan id at skb mac pointer.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds ieee802154_skb_dst_pan function to get the pointer
address of the destination pan id at skb mac pointer.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds netns support for 802.15.4 subsystem. Most parts are
copy&pasted from wireless subsystem, it has the identically userspace
API.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The PAD define should be above the experimental support. We don't care
about if we break userspace in experimental stuff but PAD is part of the
existing UAPI.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* beacon report (for radio measurement) support in cfg80211/mac80211
* hwsim: allow wmediumd in namespaces
* mac80211: extend 160MHz workaround to CSA IEs
* mesh: properly encrypt group-addressed privacy action frames
* mesh: allow setting peer AID
* first steps for MU-MIMO monitor mode
* along with various other cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXfSCuAAoJEGt7eEactAAdaugP/ilrcELQRsIN5ZXCAKYZuwXV
T01JPwgOaWL9ILu7h1SfG/+j9kzMnyk4WmRpeoj2FGNcyfG2AvULWSLpQJ2abwgQ
8o/emuLinQwRENevaMUTRSOE0HkXoFPCbbq37+a2i6bAv1QSYY3A0xvWpcU5fZ4D
7CKYDYPBAdMXYwEwy1g4nYWfDAYqS4rthr3l3rS1Cy7Q2T1ZlMlD90GjD7oeQAEw
orKulhkkDSzvxfvZCYTzXmUoBQE8sNXGDD+OFsJyowyt+ugM/xan+2tmhCaHSnda
HpdCS2aRj779UBn9cOfELjffTNpS++PM6KFd8ZDaPcJSMginn/BAHTOeNfNUfL0Q
+Enu59I82qMDzbG2z1Qezzjv7OTzyEvyvYzNbLOqljTBSBklDa3rHwhyk+g1oVBH
+4xX1Vk5QBLde+Q0NS0gTkcqOQK8KT5+HEqiUfgLSNDETN0lSGsKbtvMfU/ikE1Y
aLRkTp7nzUd03qjIFLS6RMf7JjucWWzH1ZXTHvbpDFAG7riOhYRD3Sw+0e7madTd
+HXjH9dnOGnGPDL+FyDwtW6iclYwNjcIPQiNdOjwWfMA2Wmr7iq+aFUptRCwQTHB
WtgJ3f8OHax2JXcm4grYfxELZip5vbWJJHUC84Drvmzw3X7FRITf+OEWjdNOzsRD
Fc7w5ceThh9Id1BvcH2+
=R1qP
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
One more set of new features:
* beacon report (for radio measurement) support in cfg80211/mac80211
* hwsim: allow wmediumd in namespaces
* mac80211: extend 160MHz workaround to CSA IEs
* mesh: properly encrypt group-addressed privacy action frames
* mesh: allow setting peer AID
* first steps for MU-MIMO monitor mode
* along with various other cleanups and improvements
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en.h
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
drivers/net/usb/r8152.c
All three conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next,
they are:
1) Don't use userspace datatypes in bridge netfilter code, from
Tobin Harding.
2) Iterate only once over the expectation table when removing the
helper module, instead of once per-netns, from Florian Westphal.
3) Extra sanitization in xt_hook_ops_alloc() to return error in case
we ever pass zero hooks, xt_hook_ops_alloc():
4) Handle NFPROTO_INET from the logging core infrastructure, from
Liping Zhang.
5) Autoload loggers when TRACE target is used from rules, this doesn't
change the behaviour in case the user already selected nfnetlink_log
as preferred way to print tracing logs, also from Liping Zhang.
6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields
by cache lines, increases the size of entries in 11% per entry.
From Florian Westphal.
7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian.
8) Remove useless defensive check in nf_logger_find_get() from Shivani
Bhardwaj.
9) Remove zone extension as place it in the conntrack object, this is
always include in the hashing and we expect more intensive use of
zones since containers are in place. Also from Florian Westphal.
10) Owner match now works from any namespace, from Eric Bierdeman.
11) Make sure we only reply with TCP reset to TCP traffic from
nf_reject_ipv4, patch from Liping Zhang.
12) Introduce --nflog-size to indicate amount of network packet bytes
that are copied to userspace via log message, from Vishwanath Pai.
This obsoletes --nflog-range that has never worked, it was designed
to achieve this but it has never worked.
13) Introduce generic macros for nf_tables object generation masks.
14) Use generation mask in table, chain and set objects in nf_tables.
This allows fixes interferences with ongoing preparation phase of
the commit protocol and object listings going on at the same time.
This update is introduced in three patches, one per object.
15) Check if the object is active in the next generation for element
deactivation in the rbtree implementation, given that deactivation
happens from the commit phase path we have to observe the future
status of the object.
16) Support for deletion of just added elements in the hash set type.
17) Allow to resize hashtable from /proc entry, not only from the
obscure /sys entry that maps to the module parameter, from Florian
Westphal.
18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised
anymore since we tear down the ruleset whenever the netdevice
goes away.
19) Support for matching inverted set lookups, from Arturo Borrero.
20) Simplify the iptables_mangle_hook() by removing a superfluous
extra branch.
21) Introduce ether_addr_equal_masked() and use it from the netfilter
codebase, from Joe Perches.
22) Remove references to "Use netfilter MARK value as routing key"
from the Netfilter Kconfig description given that this toggle
doesn't exists already for 10 years, from Moritz Sichert.
23) Introduce generic NF_INVF() and use it from the xtables codebase,
from Joe Perches.
24) Setting logger to NONE via /proc was not working unless explicit
nul-termination was included in the string. This fixes seems to
leave the former behaviour there, so we don't break backward.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, mesh power management functionality works only with kernel
MPM. Because user space MPM did not report mesh peer AID to kernel,
the kernel could not identify the bit in TIM element. So this patch
adds mesh peer AID setting API.
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add the following to support beacon report radio measurement
with the measurement mode field set to passive or active:
1. Propagate the required scan duration to the device
2. Report the scan start time (in terms of TSF)
3. Report each BSS's detection time (also in terms of TSF)
TSF times refer to the BSS that the interface that requested the
scan is connected to.
Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
[changed ath9k/10k, at76c59x-usb, iwlegacy, wl1251 and wlcore to match
the new API]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Beacon report radio measurement requires reporting observed BSSs
on the channels specified in the beacon request. If the measurement
mode is set to passive or active, it requires actually performing a
scan (passive or active, accordingly), and reporting the time that
the scan was started and the time each beacon/probe was received
(both in terms of TSF of the BSS of the requesting AP). If the
request mode is table, this information is optional.
In addition, the radio measurement request specifies the channel
dwell time for the measurement.
In order to use scan for beacon report when the mode is active or
passive, add a parameter to scan request that specifies the
channel dwell time, and add scan start time and beacon received time
to scan results information.
Supporting beacon report is required for Multi Band Operation (MBO).
Signed-off-by: Assaf Krauss <assaf.krauss@intel.com>
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
add API to support VHT MU-MIMO air sniffer.
in MU-MIMO there are parallel frames on the air while the HW
has only one RX.
add the capability to sniff one of the MU-MIMO parallel frames by
giving the sniffer additional information so it'll know which
of the parallel frames it shall follow.
Add attribute - NL80211_ATTR_MU_MIMO_GROUP_DATA - for getting
a MU-MIMO groupID in order to monitor packets from that group
using VHT MU-MIMO.
And add attribute -NL80211_ATTR_MU_MIMO_FOLLOW_ADDR - for passing
MAC address to monitor mode.
that option will be used by VHT MU-MIMO air sniffer to follow a
station according to it's MAC address using VHT MU-MIMO.
Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the data plane is offloaded the traffic doesn't go through the
networking stack. Therefore, after first resolving a neighbour the NUD
state machine will transition it from REACHABLE to STALE until it's
finally deleted by the garbage collector.
To prevent such situations the offloading driver should notify the NUD
state machine on any neighbours that were recently used. The driver's
polling interval should be set so that the NUD state machine can
function as if the traffic wasn't offloaded.
Currently, there are no in-tree drivers that can report confirmation for
a neighbour, but only 'used' indication. Therefore, the polling interval
should be set according to DELAY_FIRST_PROBE_TIME, as a neighbour will
transition from REACHABLE state to DELAY (instead of STALE) if "a packet
was sent within the last DELAY_FIRST_PROBE_TIME seconds" (RFC 4861).
Send a netevent whenever the DELAY_FIRST_PROBE_TIME changes - either via
netlink or sysctl - so that offloading drivers can correctly set their
polling interval.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)
Example usage:
tc filter add dev eth0 parent ffff: protocol ip pref 9 u32 \
match ip src 5.5.5.5/32 \
flowid 1:5 action skbedit ptype host
This will tag all packets incoming from 5.5.5.5 with type
PACKET_HOST
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces the polling work struct with a delayed work struct and add
a 10 ms delay between 2 poll cycles. This avoids to flood the device
with 'switch off'/'switch on' commands.
Signed-off-by: Thierry Escande <thierry.escande@collabora.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
It used to be EXPORTed, but then EXPORT usage was cleaned up
(in 2012), without noticing that the function has no users at all
(and curiously, never had any users).
Delete it.
While at it, remove non-static "inline" hints on nearby functions:
these hints don't work across compilation units anyway,
and these functions are not used in their .c file, thus they are
never inlined. IOW: "inline" here does not help in any way.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Samuel Ortiz <sameo@linux.intel.com>
CC: Christophe Ricard <christophe.ricard@gmail.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Add the commands to set and show the mode of SRIOV E-Switch, two modes
are supported:
* legacy: operating in the "old" L2 based mode (DMAC --> VF vport)
* switchdev: the E-Switch is referred to as whitebox switch configured
using standard tools such as tc, bridge, openvswitch etc. To allow
working with the tools, for each VF, a VF representor netdevice is
created by the E-Switch manager vendor device driver instance (e.g PF).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ether_addr_equal_64bits() requires some care about its arguments,
namely that 8 bytes might be read, even if last 2 byte values are not
used.
KASan detected a violation with null_mac_addr and lacpdu_mcast_addr
in bond_3ad.c
Same problem with mac_bcast[] and mac_v6_allmcast[] in bond_alb.c :
Although the 8-byte alignment was there, KASan would detect out
of bound accesses.
Fixes: 815117adaf ("bonding: use ether_addr_equal_unaligned for bond addr compare")
Fixes: bb54e58929 ("bonding: Verify RX LACPDU has proper dest mac-addr")
Fixes: 885a136c52 ("bonding: use compare_ether_addr_64bits() in ALB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some arches have virtually mapped kernel stacks, or will soon have.
tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.
David says that using percpu storage is also problematic on non SMP
builds.
Just use kmalloc() to allocate scratch areas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In previous commit 01f83d6984
the following comments were added:
"When peer uses tiny windows, there is no use in packetizing to sub-MSS
pieces for the sake of SWS or making sure there are enough packets in
the pipe for fast recovery."
The test should be > TCP_MSS_DEFAULT not >= 512. This allows low end
devices that send an MSS of 536 (TCP_MSS_DEFAULT) to see better network
performance by sending it 536 bytes of data at a time instead of bounding
to half window size (268). Other network stacks work this way, e.g. HP-UX.
Signed-off-by: Shane Seymour <shane.seymour@hpe.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
SMACK uses similar functions to control CIPSO, these are
the equivalent functions for CALIPSO and follow exactly
the same semantics.
int netlbl_cfg_calipso_add(struct calipso_doi *doi_def,
struct netlbl_audit *audit_info)
Adds a CALIPSO doi.
void netlbl_cfg_calipso_del(u32 doi, struct netlbl_audit *audit_info)
Removes a CALIPSO doi.
int netlbl_cfg_calipso_map_add(u32 doi, const char *domain,
const struct in6_addr *addr,
const struct in6_addr *mask,
struct netlbl_audit *audit_info)
Creates a mapping between a domain and a CALIPSO doi. If
addr and mask are non-NULL this creates an address-selector
type mapping.
This also extends netlbl_cfg_map_del() to remove IPv6 address-selector
mappings.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This works in exactly the same way as the CIPSO label cache.
The idea is to allow the lsm to cache the result of a secattr
lookup so that it doesn't need to perform the lookup for
every skbuff.
It introduces two sysctl controls:
calipso_cache_enable - enables/disables the cache.
calipso_cache_bucket_size - sets the size of a cache bucket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Lengths, checksum and the DOI are checked. Checking of the
level and categories are left for the socket layer.
CRC validation is performed in the calipso module to avoid
unconditionally linking crc_ccitt() into ipv6.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This makes it possible to route the error to the appropriate
labelling engine. CALIPSO is far less verbose than CIPSO
when encountering a bogus packet, so there is no need for a
CALIPSO error handler.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In some cases, the lsm needs to add the label to the skbuff directly.
A NF_INET_LOCAL_OUT IPv6 hook is added to selinux to match the IPv4
behaviour. This allows selinux to label the skbuffs that it requires.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Request sockets need to have a label that takes into account the
incoming connection as well as their parent's label. This is used
for the outgoing SYN-ACK and for their child full-socket.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
If set, these will take precedence over the parent's options during
both sending and child creation. If they're not set, the parent's
options (if any) will be used.
This is to allow the security_inet_conn_request() hook to modify the
IPv6 options in just the same way that it already may do for IPv4.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a hop-by-hop IPv6 option. A lot of this patch is based on
the equivalent CISPO code. The main difference is due to manipulating
the options in the hop-by-hop header.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This is to allow the CALIPSO labelling engine to use these.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The functionality is equivalent to ipv6_renew_options() except
that the newopt pointer is in kernel, not user, memory
The kernel memory implementation will be used by the CALIPSO network
labelling engine, which needs to be able to set IPv6 hop-by-hop
options.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Remove a specified DOI through the NLBL_CALIPSO_C_REMOVE command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Enumerate the DOI list through the NLBL_CALIPSO_C_LISTALL command.
It takes no attributes.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Query a specified DOI through the NLBL_CALIPSO_C_LIST command.
It requires the attribute:
NLBL_CALIPSO_A_DOI.
The reply will contain:
NLBL_CALIPSO_A_MTYPE
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
CALIPSO is a packet labelling protocol for IPv6 which is very similar
to CIPSO. It is specified in RFC 5570. Much of the code is based on
the current CIPSO code.
This adds support for adding passthrough-type CALIPSO DOIs through the
NLBL_CALIPSO_C_ADD command. It requires attributes:
NLBL_CALIPSO_A_TYPE which must be CALIPSO_MAP_PASS.
NLBL_CALIPSO_A_DOI.
In passthrough mode the CALIPSO engine will map MLS secattr levels
and categories directly to the packet label.
At this stage, the major difference between this and the CIPSO
code is that IPv6 may be compiled as a module. To allow for
this the CALIPSO functions are registered at module init time.
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
When qdisc bulk dequeue was added in linux-3.18 (commit
5772e9a346 "qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE"), it was constrained to some
specific qdiscs.
With some extra care, we can extend this to all qdiscs,
so that typical traffic shaping solutions can benefit from
small batches (8 packets in this patch).
For example, HTB is often used on some multi queue device.
And bonding/team are multi queue devices...
Idea is to bulk-dequeue packets mapping to the same transmit queue.
This brings between 35 and 80 % performance increase in HTB setup
under pressure on a bonding setup :
1) NUMA node contention : 610,000 pps -> 1,110,000 pps
2) No node contention : 1,380,000 pps -> 1,930,000 pps
Now we should work to add batches on the enqueue() side ;)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we defer skb drops, it makes sense to keep a copy
of skb->truesize in struct codel_skb_cb to avoid one
cache line miss per dropped skb in fq_codel_drop(),
to reduce latencies a bit further.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.
Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.
Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.
This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.
I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flag was introduced to restore rulesets from the new netdev
family, but since 5ebe0b0eec ("netfilter: nf_tables: destroy
basechain and rules on netdevice removal") the ruleset is released
once the netdev is gone.
This also removes nft_register_basechain() and
nft_unregister_basechain() since they have no clients anymore after
this rework.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to restrict this to module parameter.
We export a copy of the real hash size -- when user alters the value we
allocate the new table, copy entries etc before we update the real size
to the requested one.
This is also needed because the real size is used by concurrent readers
and cannot be changed without synchronizing the conntrack generation
seqcnt.
We only allow changing this value from the initial net namespace.
Tested using http-client-benchmark vs. httpterm with concurrent
while true;do
echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets
done
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses two problems:
1) The netlink dump is inconsistent when interfering with an ongoing
transaction update for several reasons:
1.a) We don't honor the internal NFT_TABLE_INACTIVE flag, and we should
be skipping these inactive objects in the dump.
1.b) We perform speculative deletion during the preparation phase, that
may result in skipping active objects.
1.c) The listing order changes, which generates noise when tracking
incremental ruleset update via tools like git or our own
testsuite.
2) We don't allow to add and to update the object in the same batch,
eg. add table x; add table x { flags dormant\; }.
In order to resolve these problems:
1) If the user requests a deletion, the object becomes inactive in the
next generation. Then, ignore objects that scheduled to be deleted
from the lookup path, as they will be effectively removed in the
next generation.
2) From the get/dump path, if the object is not currently active, we
skip it.
3) Support 'add X -> update X' sequence from a transaction.
After this update, we obtain a consistent list as long as we stay
in the same generation. The userspace side can detect interferences
through the generation counter so it can restart the dumping.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thus, we can reuse these to check the genmask of any object type, not
only rules. This is required now that tables, chain and sets will get a
generation mask field too in follow up patches.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
li->u.ulog.copy_len is currently ignored by the kernel, we should truncate
the packet to either li->u.ulog.copy_len (if set) or copy_range before
sending it to userspace. 0 is a valid input for copy_len, so add a new
flag to indicate whether this was option was specified by the user or not.
Add two flags to indicate whether nflog-size/copy_len was set or not.
XT_NFLOG_F_COPY_LEN is for XT_NFLOG and NFLOG_F_COPY_LEN for nfnetlink_log
On the userspace side, this was initially represented by the option
nflog-range, this will be replaced by --nflog-size now. --nflog-range would
still exist but does not do anything.
Reported-by: Joe Dollard <jdollard@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexey reported that we have GFP_KERNEL allocation when
holding the spinlock tcf_lock. Actually we don't have
to take that spinlock for all the cases, especially
for the new one we just create. To modify the existing
actions, we still need this spinlock to make sure
the whole update is atomic.
For net-next, we can get rid of this spinlock because
we already hold the RTNL lock on slow path, and on fast
path we can use RCU to protect the metalist.
Joint work with Jamal.
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Curently we store zone information as a conntrack extension.
This has one drawback: for every lookup we need to fetch the zone data
from the extension area.
This change place the zone data directly into the main conntrack object
structure and then removes the zone conntrack extension.
The zone data is just 4 bytes, it fits into a padding hole before
the tuplehash info, so we do not even increase the nf_conn structure size.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Those comparisions are useless in case of ZONES=n; all conntracks
will reside in the same zone by definition.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipgre_err() can call ip6_err_gen_icmpv6_unreach() for proper
support of ipv4+gre+icmp+ipv6+... frames, used for example
by traceroute/mtr.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 version of 3f2fb9a834 ("net: l3mdev: address selection should only
consider devices in L3 domain") and the follow up commit, a17b693cdd876
("net: l3mdev: prefer VRF master for source address selection").
That is, if outbound device is given then the address preference order
is an address from that device, an address from the master device if it
is enslaved, and then an address from a device in the same L3 domain.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 source address selection needs to consider the real egress route.
Similar to IPv4 implement a get_saddr6 method which is called if
source address has not been set. The get_saddr6 method does a full
lookup which means pulling a route from the VRF FIB table and properly
considering linklocal/multicast destination addresses. Lookup failures
(eg., unreachable) then cause the source address selection to fail
which gets propagated back to the caller.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF driver needs access to ip6_route_get_saddr code. Since it does
little beyond ipv6_dev_get_saddr and ipv6_dev_get_saddr is already
exported for modules move ip6_route_get_saddr to the header as an
inline.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fact is VXLAN with Generic Protocol Extensions cannot be supported by
the same hardware parsers that support VXLAN. The protocol extensions
allow for things like a Next Protocol field which in turn allows for things
other than Ethernet to be passed over the tunnel. Most existing parsers
will not know how to interpret this.
To resolve this I am giving VXLAN-GPE its own UDP encapsulation offload
type. This way hardware that does support GPE can simply add this type to
the switch statement for VXLAN, and if they don't support it then this will
fix any issues where headers might be interpreted incorrectly.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.
In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.
I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the GENEVE and VXLAN code so that both functions pass
through a shared code path. This way we can start the effort of using a
single function on the network device drivers to handle both of these
tunnel types.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes it so that we add udp_tunnel.h to vxlan.h and geneve.h
header files. This is useful as I plan to move the generic handlers for
the port offloads into the udp_tunnel header file and leave the vxlan and
geneve headers to be a bit more protocol specific.
I also went through and cleaned out a number of redundant includes that
where in the .h and .c files for these drivers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are rather small patches but fixing several outstanding bugs in
nf_conntrack and nf_tables, as well as minor problems with missing
SYNPROXY header uapi installation:
1) Oneliner not to leak conntrack kmemcache on module removal, this
problem was introduced in the previous merge window, patch from
Florian Westphal.
2) Two fixes for insufficient ruleset loop validation, one due to
incorrect flag check in nf_tables_bind_set() and another related to
silly wrong generation mask logic from the walk path, from Liping
Zhang.
3) Fix double-free of anonymous sets on error, this fix simplifies the
code to let the abort path take care of releasing the set object,
also from Liping Zhang.
4) The introduction of helper function for transactions broke the skip
inactive rules logic from the nft_do_chain(), again from Liping
Zhang.
5) Two patches to install uapi xt_SYNPROXY.h header and calm down
kbuild robot due to missing #include <linux/types.h>.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) gre_parse_header() can be called from gre_err()
At this point transport header points to ICMP header, not the inner
header.
2) We can not really change transport header as ipgre_err() will later
assume transport header still points to ICMP header (using icmp_hdr())
3) pskb_may_pull() logic in gre_parse_header() really works
if we are interested at zone pointed by skb->data
4) As Jiri explained in commit b7f8fe251e ("gre: do not pull header in
ICMP error processing") we should not pull headers in error handler.
So this fix :
A) changes gre_parse_header() to use skb->data instead of
skb_transport_header()
B) Adds a nhs parameter to gre_parse_header() so that we can skip the
not pulled IP header from error path.
This offset is 0 for normal receive path.
C) remove obsolete IPV6 includes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the presence of firewalls which improperly block ICMP Unreachable
(including Fragmentation Required) messages, Path MTU Discovery is
prevented from working.
A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
is done for other payloads like AppleTalk--and doing transparent
fragmentation and reassembly.
Redux includes the enforcement of mutual exclusion between this feature
and Path MTU Discovery as suggested by Alexander Duyck.
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.
These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds __ndisc_opt_addr_data as low-level function for
ndisc_opt_addr_data which doesn't depend on net_device parameter.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds __ndisc_opt_addr_space as low-level function for
ndisc_opt_addr_space which doesn't depend on net_device parameter.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.
The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.
When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.
This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.
Actual freeing happens right after RTNL is released,
with appropriate scheduling points.
rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.
qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
packets to/from these addresses should use direct FIB lookups based on
the VRF device table.
2. fail sends/receives on a VRF device to/from a multicast address
(e.g, make ping6 ff02::1%<vrf> fail)
3. move the setting of the flow oif to the first dst lookup and revert
the change in icmpv6_echo_reply made in ca254490c8 ("net: Add VRF
support to IPv6 stack"). Linklocal/mcast addresses require use of the
skb->dev.
With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,
1. packets into VM with VRF config:
ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
ping6 -c3 ff02::1%br1
ssh -6 fe80::e0:f9ff:fe1c:b974%br1
2. packets going out a VRF enslaved device:
ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
ping6 -c3 ff02::1%eth1
ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Liping Zhang says:
"Users may add such a wrong nft rules successfully, which will cause an
endless jump loop:
# nft add rule filter test tcp dport vmap {1: jump test}
This is because before we commit, the element in the current anonymous
set is inactive, so osp->walk will skip this element and miss the
validate check."
To resolve this problem, this patch passes the generation mask to the
walk function through the iter container structure depending on the code
path:
1) If we're dumping the elements, then we have to check if the element
is active in the current generation. Thus, we check for the current
bit in the genmask.
2) If we're checking for loops, then we have to check if the element is
active in the next generation, as we're in the middle of a
transaction. Thus, we check for the next bit in the genmask.
Based on original patch from Liping Zhang.
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Liping Zhang <liping.zhang@spreadtrum.com>
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.
I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.
When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.
Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.
This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXWWe/AAoJEGt7eEactAAdXVsQAKGkdXGmUU14tfiRCnZryMEH
GyiVDHQivfcadicbK9599LXadvYc6ATZHfYoSnROwB82NfhKB71N8dUJ4qePxLNa
lDe5uuuAgxtHG63hK1R02flPorkBAEcUcdHsFSuOw7JWag4/49sCqWH8X5K8E9aT
vrhaPeppJPydwnmNzNOBvMsLEJFJdZWaEZupZQ0kiJ/jB/howFdzF75GxGQ2jh+F
dPk22/PtV2igCtntqaty7h057AdZ+znQuiUdVB7eYIOle7veeGyMzFFf0xQ99LAZ
+xcA7GA74u6m8O6SUVw+6nhrUJ5XTsKGUtmKCTVOcUGa5z7XDD7NIxk2SgloCJkC
qPqVU8wyBCxKc+6JsyiVSLcB5MWvWxifvo0OyLsbCqhN50bTtJT96ymKOAx4NeSG
s+TYlV2HgVRNN6PzGvl/ZUo8Mm1UFGWlCBPcyMy9Fwc0jxdhnOAZzOtqV1yVwlCz
M43RKwxBX6MHAtlfwy6g3M5ievAviwY3Kt+yGeLRIVsJJfOUdQxYAa+m7GTeohD/
Tns4IHbtWJiLTKuaTvrs4ec3ycXrv7iibMINcarvkTgbbF9Qvarf/e2RoWO/freY
OUixDdG7HmsY0XIUhSepeMJpn5xIhQ7dkmckhzDEgZqBM6uPJEMB2+O/CGNDsxEp
VjfnwXJ+9PQrjheBvnD2
=FvhQ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the next cycle, we have the following:
* the biggest change is Michał's work on integrating FQ/codel
with the mac80211 internal software queues
* cfg80211 connect result gets clarified for the
"no connection at all" case
* advertisement of per-interface type capabilities, in case
they differ (which makes a lot of sense for some capabilities)
* most of the nl80211 & hwsim unprivileged namespace operation
changes
* human-readable VHT capabilities in debugfs
* some other cleanups, like spelling
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add in_flight (bytes in flight when packet was sent) field
to tx component of tcp_skb_cb and make it available to
congestion modules' pkts_acked() function through the
ack_sample function argument.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/sched/act_police.c
net/sched/sch_drr.c
net/sched/sch_hfsc.c
net/sched/sch_prio.c
net/sched/sch_red.c
net/sched/sch_tbf.c
In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.
A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.
Signed-off-by: David S. Miller <davem@davemloft.net>
Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
a pointer into user memory. If userland is 32-bit and kernel is 64-bit
the two disagree about the layout of struct sock_fprog.
Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
SO_REUSEPORT added in commit 1957598840 ("soreuseport: add compat
case for setsockopt SO_ATTACH_REUSEPORT_CBPF").
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) qdisc_run_begin() is really using the equivalent of a trylock.
Instead of using write_seqcount_begin(), use a combination of
raw_write_seqcount_begin() and correct lockdep annotation.
2) sch_direct_xmit() should use regular spin_lock(root_lock)
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no other limit other than a global
packet count limit when using software queuing.
This means a single flow queue can grow insanely
long. This is particularly bad for TCP congestion
algorithms which requires a little more
sophisticated frame dropping scheme than a mere
headdrop on limit overflow.
Hence apply (a slighly modified, to fit the knobs)
CoDel5 on flow queues. This improves TCP
convergence and stability when combined with
wireless driver which keeps its own tx queue/fifo
at a minimum fill level for given link conditions.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Qdiscs are designed with no regard to 802.11
aggregation requirements and hand out
packet-by-packet with no guarantee they are
destined to the same tid. This does more bad than
good no matter how fairly a given qdisc may behave
on an ethernet interface.
Software queuing used per-AC netdev subqueue
congestion control whenever a global AC limit was
hit. This meant in practice a single station or
tid queue could starve others rather easily. This
could resonate with qdiscs in a bad way or could
just end up with poor aggregation performance.
Increasing the AC limit would increase induced
latency which is also bad.
Disabling qdiscs by default and performing
taildrop instead of netdev subqueue congestion
control on the other hand makes it possible for
tid queues to fill up "in the meantime" while
preventing stations starving each other.
This increases aggregation opportunities and
should allow software queuing based drivers
achieve better performance by utilizing airtime
more efficiently with big aggregates.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Earlier commits removed two members from struct Qdisc which places
next_sched/gso_skb into a different cacheline than ->state.
This restores the struct layout to what it was before the removal.
Move the two members, then add an annotation so they all reside in the
same cacheline.
This adds a 16 byte hole after cpu_qstats.
The hole could be closed but as it doesn't decrease total struct size just
do it this way.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 doesn't implement any cbq option that results in this attribute
being sent to kernel.
To make use of it, user would have to
- patch iproute2
- add a class
- attach a qdisc to the class (default pfifo doesn't work as
q->handle is 0 and cbq_set_police() is a no-op in this case)
- re-'add' the same class (tc class change ...) again
- user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
this 'police' feature relies on its presence
- the added qdisc must be one of bfifo, pfifo or netem
If all of these conditions are met and _some_ leaf qdiscs, namely
p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
cbq, which will attempt to re-queue the skb into a different class
as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
[ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
This feature, which isn't documented or implemented in iproute2,
and isn't implemented consistently (most qdiscs like sfq, codel, etc
drop right away instead of attempting this reclassification) is the
sole reason for the reshape_fail and __parent member in Qdisc struct.
So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
so userspace knows we don't support it, and then remove no-longer needed
infrastructure in followup commit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.
This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).
If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.
Currently, the rules list for both ipv4 and ipv6 look like this:
$ ip ru ls
1000: from all oif vrf1 lookup 1001
1000: from all iif vrf1 lookup 1001
1000: from all oif vrf2 lookup 1002
1000: from all iif vrf2 lookup 1002
1000: from all oif vrf3 lookup 1003
1000: from all iif vrf3 lookup 1003
1000: from all oif vrf4 lookup 1004
1000: from all iif vrf4 lookup 1004
1000: from all oif vrf5 lookup 1005
1000: from all iif vrf5 lookup 1005
1000: from all oif vrf6 lookup 1006
1000: from all iif vrf6 lookup 1006
1000: from all oif vrf7 lookup 1007
1000: from all iif vrf7 lookup 1007
1000: from all oif vrf8 lookup 1008
1000: from all iif vrf8 lookup 1008
...
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
With the l3mdev rule the list is just the following regardless of the
number of VRFs:
$ ip ru ls
1000: from all lookup [l3mdev table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
(Note: the above pretty print of the rule is based on an iproute2
prototype. Actual verbage may change)
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we can properly support multiple distinct trees in the system,
using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered
as soon as the second switch tree gets probed, and we don't want that.
We need to move this to be dynamically allocated, and since we can't
really be comparing addresses anymore to determine first time
initialization versus any other times, just move this to dsa.c and
dsa2.c where the remainder of the dst/ds initialization happens.
The operations teardown restores the master netdev's ethtool_ops to its
original ethtool_ops pointer (typically within the Ethernet driver)
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains two Netfilter/IPVS fixes for your net
tree, they are:
1) Fix missing alignment in next offset calculation for standard
targets, introduced in the previous merge window, patch from
Florian Westphal.
2) Fix to correct the handling of outgoing connections which use the
SIP-pe such that the binding of a real-server is updated when needed.
This was an omission from changes introduced by Marco Angaroni in
the previous merge window too, to allow handling of outgoing
connections by the SIP-pe. Patch and report came via Simon Horman.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When offloading classifiers such as u32 or flower to hardware, and the
qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
since not all of them handle ingress, therefore we must leave those in
software path. Add a .tcf_cl_offload() callback, so we can generically
handle them, tested on ixgbe.
Fixes: 10cbc68434 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
Fixes: 5b33f48842 ("net/flower: Introduce hardware offload support")
Fixes: a1b7c5fd7f ("net: sched: add cls_u32 offload hooks for netdevs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :
For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()
In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.
[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.
This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Useful to know when the action was first used for accounting
(and debugging)
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For gso_skb we only update qlen, backlog should be updated too.
Note, it is correct to just update these stats at one layer,
because the gso_skb is cached there.
Reported-by: Stas Nichiporovich <stasn77@gmail.com>
Fixes: 2ccccf5fb4 ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch that introduced handling of outgoing packets in SIP
persistent-engine did not call ip_vs_check_template() in case packet was
matching a connection template. Assumption was that real-server was
healthy, since it was sending a packet just in that moment.
There are however real-server fault conditions requiring that association
between call-id and real-server (represented by connection template)
gets updated. Here is an example of the sequence of events:
1) RS1 is a back2back user agent that handled call-id1 and call-id2
2) RS1 is down and was marked as unavailable
3) new message from outside comes to IPVS with call-id1
4) IPVS reschedules the message to RS2, which becomes new call handler
5) RS2 forwards the message outside, translating call-id1 to call-id2
6) inside pe->conn_out() IPVS matches call-id2 with existing template
7) IPVS does not change association call-id2 <-> RS1
8) new message comes from client with call-id2
9) IPVS reschedules the message to a real-server potentially different
from RS2, which is now the correct destination
This patch introduces ip_vs_check_template() call in the handling of
outgoing packets for SIP-pe. And also introduces a second optional
argument for ip_vs_check_template() that allows to check if dest
associated to a connection template is the same dest that was identified
as the source of the packet. This is to change the real-server bound to a
particular call-id independently from its availability status: the idea
is that it's more reliable, for in->out direction (where internal
network can be considered trusted), to always associate a call-id with
the last real-server that used it in one of its messages. Think about
above sequence of events where, just after step 5, RS1 returns instead
to be available.
Comparison of dests is done by simply comparing pointers to struct
ip_vs_dest; there should be no cases where struct ip_vs_dest keeps its
memory address, but represent a different real-server in terms of
ip-address / port.
Fixes: 39b9722315 ("ipvs: handle connections started by real-servers")
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The existing DSA binding has a number of limitations and problems. The
main problem is that it cannot represent a switch as a linux device,
hanging off some bus. It is limited to one CPU port. The DSA platform
device is artificial, and does not really represent hardware.
Implement a new binding which can be embedded into any type of node on
a bus to represent one switch device, and its links to other switches.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the two switch statements with an array lookup, and store the
result in the dsa tree structure. The drivers no longer need to know
the selected tag protocol, so remove it from the dsa switch structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new binding will not have a chip data structure, it will place the
routing directly into the switch structure. To enable backwards
compatibility, copy the routing from the chip data into the switch
structure.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With a maximum of four switches, the size of the routing table is the
same as the pointer to it. Removing it makes the code simpler.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the port device node structure into the port structure, from the
chip data. This information is needed in the next step of implementing
the new binding.
The chip data structure is used while parsing the whole old binding,
before the individual switch structures exist. With the new bindings,
this is reversed, the switches exist first, and the interconnections
between the switches is derived from the individual switch
bindings. Thus this chip data structure becomes unneeded.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
eviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are going to be more per-port members added to the switch
structure. So add a port structure and move the netdev into it.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP has this pecualiarity that its packets cannot be just segmented to
(P)MTU. Its chunks must be contained in IP segments, padding respected.
So we can't just generate a big skb, set gso_size to the fragmentation
point and deliver it to IP layer.
This patch takes a different approach. SCTP will now build a skb as it
would be if it was received using GRO. That is, there will be a cover
skb with protocol headers and children ones containing the actual
segments, already segmented to a way that respects SCTP RFCs.
With that, we can tell skb_segment() to just split based on frag_list,
trusting its sizes are already in accordance.
This way SCTP can benefit from GSO and instead of passing several
packets through the stack, it can pass a single large packet.
v2:
- Added support for receiving GSO frames, as requested by Dave Miller.
- Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
- Added heuristics similar to what we have in TCP for not generating
single GSO packets that fills cwnd.
v3:
- consider sctphdr size in skb_gso_transport_seglen()
- rebased due to 5c7cdf339a ("gso: Remove arbitrary checks for
unsupported GSO")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix incorrect timestamp in nfnetlink_queue introduced when addressing
y2038 safe timestamp, from Florian Westphal.
2) Get rid of leftover conntrack definition from the previous merge
window, oneliner from Florian.
3) Make nf_queue handler pernet to resolve race on dereferencing the
hook state structure with netns removal, from Eric Biederman.
4) Ensure clean exit on unregistered helper ports, from Taehee Yoo.
5) Restore FLOWI_FLAG_KNOWN_NH in nf_dup_ipv6. This got lost while
generalizing xt_TEE to add packet duplication support in nf_tables,
from Paolo Abeni.
6) Insufficient netlink NFTA_SET_TABLE attribute check in
nf_tables_getset(), from Phil Turnbull.
7) Reject helper registration on duplicated ports via modparams.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing"), udp_csum_pull_header() helper was added but missed fact
that CHECKSUM_UNNECESSARY packets were now converted to CHECKSUM_NONE
and skb->csum_valid was set to 1 for them.
Since csum_partial() is quite expensive, even for 8-byte area, it is
worth adding a test.
We also can use skb->data instead of udp_hdr() as we are pulling
UDP headers, as it is sightly faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver extended capabilities may differ for different
interface types which the userspace needs to know (for
example the fine timing measurement initiator and responder
bits might differ for a station and AP). Add a new nl80211
attribute to provide extended capabilities per interface type
to userspace.
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Previously, the status parameter to cfg80211_connect_result() was
documented as using WLAN_STATUS_UNSPECIFIED_FAILURE (1) when the real
status code for the failure is not known. This value can be used by an
AP (and often is) and as such, user space cannot distinguish between
explicitly rejected authentication/association and not being able to
even try to associate or not receiving a response from the AP.
Add a new inline function, cfg80211_connect_timeout(), to be used when
the driver knows that the connection attempt failed due to a reason
where connection could not be attempt or no response was received from
the AP. The internal functions now allow a negative status value (-1) to
be used as an indication of this special case. This results in the
NL80211_ATTR_TIMED_OUT to be added to the NL80211_CMD_CONNECT event to
allow user space to determine this case was hit. For backwards
compatibility, NL80211_STATUS_CODE with the value
WLAN_STATUS_UNSPECIFIED_FAILURE is still indicated in the event in such
a case.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
[johannes: fix cfg80211_connect_bss() prototype to use int for status,
add cfg80211_connect_timeout() to docbook, fix docbook]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A recent cleanup moved MAX_IPTUN_ENCAP_OPS along with some other
definitions, but it is now invisible when CONFIG_INET is
not defined, but still referenced from ip6_tunnel.h:
In file included from net/xfrm/xfrm_input.c:17:0:
include/net/ip6_tunnel.h:67:17: error: 'MAX_IPTUN_ENCAP_OPS' undeclared here (not in a function)
ip6tun_encaps[MAX_IPTUN_ENCAP_OPS];
^~~~~~~~~~~~~~~~~~~
This hides the ip6_encap_hlen and ip6_tnl_encap functions inside
of CONFIG_INET so we don't run into the the problem.
Alternatively we could move the macro out of the #ifdef again to
restore the previous behavior
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 55c2bc1432 ("net: Cleanup encap items in ip_tunnels.h")
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Weber reported:
> Under full load (unshare() in loop -> OOM conditions) we can
> get kernel panic:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> IP: [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> [..]
> task: ffff88012dfa3840 ti: ffff88012dffc000 task.ti: ffff88012dffc000
> RIP: 0010:[<ffffffff81476c85>] [<ffffffff81476c85>] nfqnl_nf_hook_drop+0x35/0x70
> RSP: 0000:ffff88012dfffd80 EFLAGS: 00010206
> RAX: 0000000000000008 RBX: ffffffff81add0c0 RCX: ffff88013fd80000
> [..]
> Call Trace:
> [<ffffffff81474d98>] nf_queue_nf_hook_drop+0x18/0x20
> [<ffffffff814738eb>] nf_unregister_net_hook+0xdb/0x150
> [<ffffffff8147398f>] netfilter_net_exit+0x2f/0x60
> [<ffffffff8141b088>] ops_exit_list.isra.4+0x38/0x60
> [<ffffffff8141b652>] setup_net+0xc2/0x120
> [<ffffffff8141bd09>] copy_net_ns+0x79/0x120
> [<ffffffff8106965b>] create_new_namespaces+0x11b/0x1e0
> [<ffffffff810698a7>] unshare_nsproxy_namespaces+0x57/0xa0
> [<ffffffff8104baa2>] SyS_unshare+0x1b2/0x340
> [<ffffffff81608276>] entry_SYSCALL_64_fastpath+0x1e/0xa8
> Code: 65 00 48 89 e5 41 56 41 55 41 54 53 83 e8 01 48 8b 97 70 12 00 00 48 98 49 89 f4 4c 8b 74 c2 18 4d 8d 6e 08 49 81 c6 88 00 00 00 <49> 8b 5d 00 48 85 db 74 1a 48 89 df 4c 89 e2 48 c7 c6 90 68 47
>
The simple fix for this requires a new pernet variable for struct
nf_queue that indicates when it is safe to use the dynamically
allocated nf_queue state.
As we need a variable anyway make nf_register_queue_handler and
nf_unregister_queue_handler pernet. This allows the existing logic of
when it is safe to use the state from the nfnetlink_queue module to be
reused with no changes except for making it per net.
The syncrhonize_rcu from nf_unregister_queue_handler is moved to a new
function nfnl_queue_net_exit_batch so that the worst case of having a
syncrhonize_rcu in the pernet exit path is not experienced in batch
mode.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I found a serious performance bug in packet schedulers using hrtimers.
sch_htb and sch_fq are definitely impacted by this problem.
We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.
hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :
if (timer_pending(timer) &&
timer->expires == expires)
return 1;
This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.
I used following fix :
1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.
2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
->sk_shutdown bits share one bitfield with some other bits in sock struct,
such as ->sk_no_check_[r,t]x, ->sk_userlocks ...
sock_setsockopt() may write to these bits, while holding the socket lock.
In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only
unix_state_lock(). So concurrent setsockopt() and shutdown() may lead
to corrupting these bits.
Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte.
This will not change the 'struct sock' size since ->sk_shutdown moved into
previously unused 16-bit hole.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a new fou6 module that provides encapsulation
operations for IPv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create __fou_build_header and __gue_build_header. These implement the
protocol generic parts of building the fou and gue header.
fou_build_header and gue_build_header implement the IPv4 specific
functions and call the __*_build_header functions.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate all the ip_tunnel_encap definitions in one spot in the
header file. Also, move ip_encap_hlen and ip_tunnel_encap from
ip_tunnel.c to ip_tunnels.h so they call be called without a dependency
on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_NET_CLS_ACT is disabled, we get a new warning in the mlx5
ethernet driver because the tc_for_each_action() loop never references
the iterator:
mellanox/mlx5/core/en_tc.c: In function 'mlx5e_stats_flower':
mellanox/mlx5/core/en_tc.c:431:20: error: unused variable 'a' [-Werror=unused-variable]
struct tc_action *a;
This changes the dummy tc_for_each_action() macro by adding a
cast to void, letting the compiler know that the variable is
intentionally declared but not used here. I could not come up
with a nicer workaround, but this seems to do the trick.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aad7e08d39 ("net/mlx5e: Hardware offloaded flower filter statistics support")
Fixes: 00175aec94 ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
Acked-By: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that fib_info->nh is [0] so the struct fib_info
allocation size depends on number of nexthops. If we just copy fib_info,
we do not copy the nexthops info and driver accesses memory which is not
ours.
Given the fact that fib4 does not defer operations and therefore it does
not need copy, just pass the pointer down to drivers as it was done
before.
Fixes: 850d0cbc91 ("switchdev: remove pointers from switchdev objects")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is not used anymore. nla_put_u64_64bit() should be used
instead.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new command in ndo_setup_tc() for hardware offloaded
filters, to call the NIC driver, and make it update the statistics.
This will be done before dumping the filter and its statistics.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce stats_update callback. netdev driver could call it for offloaded
actions to update the basic statistics (packets, bytes and last use).
Since bstats_update() and bstats_cpu_update() use skb as an argument to
get the counters, _bstats_update() and _bstats_cpu_update(), that get
bytes and packets as arguments, were added.
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On devices that support TC U32 offloads, this flag enables a filter to be
added only to HW. skip-sw and skip-hw are mutually exclusive flags. By
default without any flags, the filter is added to both HW and SW, but no
error checks are done in case of failure to add to HW. With skip-sw,
failure to add to HW is treated as an error.
Here is a sample script that adds 2 filters, one with skip-sw and the other
with skip-hw flag.
# add ingress qdisc
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# add u32 filter with skip-sw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: flowid 800:1 \
skip-sw \
match ip src 192.168.1.0/24 \
action drop
# add u32 filter with skip-hw flag.
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: flowid 800:2 \
skip-hw \
match ip src 192.168.2.0/24 \
action drop
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with WCCP in gre6 tunnel, it sets the wrong tpi->protocol,
that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXNGMwAAoJEGt7eEactAAdrRMP/RGmFd1I8Z7gIsKc7pdknf20
4xLA3v8cqD7c9OtqlfVSfIyIosNl+pMatXK4eSaYzJICmdV9lU8VqgW6eH9lmDuM
q8Eis2/i7ZozzYhANVFQxMRvmSED8MeRm49LOzRofuPg4FHzyTYd8+TRfQ+QGRm9
xlrKTJ4BwH5lZgmH20/cphbffo5j8IyjzjKw0MC/tjikIePww6Am/r1jpCnd0Mwz
rWwRa4pQ9JBV7UzzzdcMpdn3KJkteFq/gJPCpZVDo3Zf+L21UavxP99NoUWiUvUU
bEQ9FzcuB7Zyt/lCAyu3ECBGqLvskqseFg3zmDnUoL7GmCZ1PZlfsnfOhQP0HgpI
t/dyw+TYIu6d4ZHbqGM6q6usjIJLltaxATwvREq3VdWFn5fYFu0Em3CyQ0on1XGn
4anNdkilGxaVEHCFBvfLvMTEfsTj7eycsNsDrZEGYV3NwX/wfz+6thqqu8gXFUsC
4MB1fS+YPg7u72ne+oJSPyJyFofYXDSjdNqnZ6HSQEZHinE+FnDCywZlJesxzDsH
q1ABkKLnmMNIwHxbfRBE0eO8sKXo6kUTJeFmWHtxB8/LcsLsV0/8juUy9H5PPZPU
7TJhn4SZrlJb92xq1DQ0t1xsRxTLonbRuu9PtVVZMUhZP69PpzK9toC6eCo9bovn
XFdSQ2c2xRwRAiSjUpQP
=QNxF
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Some more work for 4.7, notably:
* completion and fixups of nla_put_64_64bit() work
* remove a/b/g/n from wext nickname to avoid confusion
with 11ac (which wouldn't even fit fully there due to
string length restrictions)
along with some other minor changes/cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When using RSS, frames might not be processed in the correct order,
and thus AP_LINK_PS must be used; most likely with firmware keeping
track of the powersave state, this is the case in iwlwifi now.
In this case, the driver can use ieee80211_sta_ps_transition() to
still have mac80211 manage powersave buffering. However, for U-APSD
and PS-Poll this isn't sufficient. If the device can't manage that
entirely on its own, mac80211's code should be used.
To allow this, export two functions: ieee80211_sta_uapsd_trigger()
and ieee80211_sta_pspoll().
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no harm in having drivers read the list, since they can
use RCU protection or RTNL locking; allow this to not require
each and every driver to also implement its own bookkeeping.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows finding vendor IE from a specific vendor.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some hardware (iwlwifi an example) de-aggregate AMSDUs and copy the IV
as is to the generated MPDUs, so the same PN appears in multiple
packets without being a replay attack. Allow driver to explicitly
indicate that a frame is allowed to have the same PN as the previous
frame.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXMRA8AAoJEIqAPN1PVmxKknQP/2WyaVYD+jeRkM4pymRQ/6s6
V2+be68JlatvC53D6fbzo9FF8/QO7w++vTF+KI5nV6g/RvkPvKnxg2DC39fob0DA
6x7KaRuUrTwxBx4tfsvtNfUL4idRi3i6/hrYRJyzaNMHKAukDbjJQiCchveUhZeB
NlfYSEcieNnRJkGpdVGny4tea0fFOY6v+Uz74h2HboKbSD3SGNK29v47AqxH/ssg
Y9XycyYWWF1w2uOUfFzJFsvBAYcPlBDF4WqQ8SkamxhaPAB047Ysfp3y9vWGc8By
pdXLNAALzlCIt2JeGsrXhLG+Zf3y3w3G7HlcVbkh5BcB6fN2FYpPZurY8Km8Z6SM
ufgP8MwGOvOJa9NKapSm3FF6Tl1B3gCK3+SWT5N/lSNEFnaoZ6xMNGqXe1RhTght
+oqfuHLNO0YulQG0/ixJJl08p4eb6hhUMO3LKan3AYub9kgk63I43XeKXhIlqmLJ
tEh3xAPzbPyu+nwkrJF0B2tMeUdrCKZaucru/upYcZi7LDbQOvMjHgTS8YmX/aBk
gCOjINs+U1Q9jN2B7fMJe1uP6DJOTVW+xqiaLBLwHawRBhQQPbg+pLK/sHgFWcaa
bdUWASp7n5Bu5pY/rPHL+xvA8O5z5ngJ7ZPjqB2CvHr9XyQDna3Fo9jA/w3hgr/q
DgvCHukjvkDlUochojQZ
=zM22
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.7 pull request
This is the first NFC pull request for 4.7. With this one we
mainly have:
- Support for NXP's pn532 NFC chipset. The pn532 is based on the same
microcontroller as the pn533, but it talks to the host through i2c
instead of USB. By separating the pn533 driver into core and PHY
parts, we can not add the i2c layer and support the pn532 chipset.
- Support for NCI's loopback mode. This is a testing mode where each
packet received by the NFCC is sent back to the DH, allowing the
host to test that the controller can receive and send data.
- A few ACPI related fixes for the STMicro drivers, in order to match
the device tree naming scheme.
- A bunch of cleanups for the st-nci and the st21nfca STMicro drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa_switch structure contains a dsa_chip_data member called pd.
However in the rest of the code, pd is used for dsa_platform_data.
This is confusing. Rename it cd, which is already often used in dsa.c
and slave.c for this data type.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch drivers only use the master_dev member for dev_info()
messages. Now that the device is passed to the old style probe, and
new style drivers are probed as true linux drivers, this is no longer
needed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resetting the switch is something the driver does, not the framework.
So move the parsing of this property into the driver.
There are no in kernel users of this property, so moving it does not
break anything. There is however a board which will make use of this
property making its way into the kernel.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.
The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.
This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.
dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace 2 arguments (cnt and rtt) in the congestion control modules'
pkts_acked() function with a struct. This will allow adding more
information without having to modify existing congestion control
modules (tcp_nv in particular needs bytes in flight when packet
was sent).
As proposed by Neal Cardwell in his comments to the tcp_nv patch.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an initial implementation of a netdev driver for GTP datapath
(GTP-U) v0 and v1, according to the GSM TS 09.60 and 3GPP TS 29.060
standards. This tunneling protocol is used to prevent subscribers from
accessing mobile carrier core network infrastructure.
This implementation requires a GGSN userspace daemon that implements the
signaling protocol (GTP-C), such as OpenGGSN [1]. This userspace daemon
updates the PDP context database that represents active subscriber
sessions through a genetlink interface.
For more context on this tunneling protocol, you can check the slides
that were presented during the NetDev 1.1 [2].
Only IPv4 is supported at this time.
[1] http://git.osmocom.org/openggsn/
[2] http://www.netdevconf.org/1.1/proceedings/slides/schultz-welte-osmocom-gtp.pdf
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move l3mdev_rt6_dst_by_oif and l3mdev_get_saddr to l3mdev.c. Collapse
l3mdev_get_rt6_dst into l3mdev_rt6_dst_by_oif since it is the only
user and keep the l3mdev_get_rt6_dst name for consistency with other
hooks.
A follow-on patch adds more code to these functions making them long
for inlined functions.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.
Several updates for IPVS from Marco Angaroni:
1) Allow SIP connections originating from real-servers to be load
balanced by the SIP persistence engine as is already implemented
in the other direction.
2) Release connections immediately for One-packet-scheduling (OPS)
in IPVS, instead of making it via timer and rcu callback.
3) Skip deleting conntracks for each one packet in OPS, and don't call
nf_conntrack_alter_reply() since no reply is expected.
4) Enable drop on exhaustion for OPS + SIP persistence.
Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:
5) Move conntrack generation counter out of conntrack pernet structure
since this is only used by the init_ns to allow hash resizing.
6) Use get_random_once() from packet path to collect hash random seed
instead of our compound.
7) Don't disable BH from ____nf_conntrack_find() for statistics,
use NF_CT_STAT_INC_ATOMIC() instead.
8) Fix lookup race during conntrack hash resizing.
9) Introduce clash resolution on conntrack insertion for connectionless
protocol.
Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:
11) Use a single conntrack hashtable for all namespaces. Include netns
in object comparisons and make it part of the hash calculation.
Adapt early_drop() to consider netns.
12) Use single expectation and NAT hashtable for all namespaces.
13) Use a single slab cache for all namespaces for conntrack objects.
14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
conntrack counter tells us the table is empty (ie. equals zero).
Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.
15) Parse element flags from element deletion path and pass it up to the
backend set implementation.
16) Allow adjacent intervals in the rbtree set type for dynamic interval
updates.
17) Add support to set connlabel from nf_tables, from Florian Westphal.
18) Allow set names up to 32 bytes in nf_tables.
Several x_tables fixes and updates:
19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
from Andrzej Hajda.
And finally, miscelaneous netfilter updates such as:
20) Disable automatic helper assignment by default. Note this proc knob
was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
disable automatic helper assignment") 4 years ago to start moving
towards explicit conntrack helper configuration via iptables CT
target.
21) Get rid of obsolete and inconsistent debugging instrumentation
in x_tables.
22) Remove unnecessary check for null after ip6_route_output().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An earlier patch changed lookup side to also net_eq() namespaces after
obtaining a reference on the conntrack, so a single kmemcache can be used.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash, so we only need to use
net_eq in find_appropriate_src and can then put all entries into
same table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Refactor tcp_skb_cb to create two overlaping areas to store
state for incoming or outgoing skbs based on comments by
Neal Cardwell to tcp_nv patch:
AFAICT this patch would not require an increase in the size of
sk_buff cb[] if it were to take advantage of the fact that the
tcp_skb_cb header.h4 and header.h6 fields are only used in the packet
reception code path, and this in_flight field is only used on the
transmit side.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The setting of the UDP tunnel GSO type is already performed by
udp[46]_gro_complete().
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the expectation table.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, we support set names of up to 16 bytes, get this aligned
with the maximum length we can use in ipset to make it easier when
considering migration to nf_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces nf_ct_resolve_clash() to resolve race condition on
conntrack insertions.
This is particularly a problem for connection-less protocols such as
UDP, with no initial handshake. Two or more packets may race to insert
the entry resulting in packet drops.
Another problematic scenario are packets enqueued to userspace via
NFQUEUE after the raw table, that make it easier to trigger this
race.
To resolve this, the idea is to reset the conntrack entry to the one
that won race. Packet and bytes counters are also merged.
The 'insert_failed' stats still accounts for this situation, after
this patch, the drop counter is bumped whenever we drop packets, so we
can watch for unresolved clashes.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.
Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.
NAT bysrc and expectation hash is still per namespace, those will
changed too soon.
Future patch will also make conntrack object slab cache global again.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2016-05-04
This series contains updates to ixgbe, ixgbevf and traffic class helpers.
Sridhar adds helper functions to the tc_mirred header to access tcf_mirred
information and then implements them for ixgbe to enable redirection to
a SRIOV VF or an offloaded MACVLAN device queue via tc 'mirred' action.
Amritha adds support to set filters with multiple header fields (L3,L4)
to match on.
KY Srinivasan from Microsoft add Hyper-V support into ixgbevf.
Emil adds 82599 sub-device IDs that were missing from the list of parts
that support WoL. Then simplified the logic we use to determine WoL
support by reading the EEPROM bits for MACs X540 and newer.
Preethi cleaned up duplicate and unused device IDs. Fixed our ethtool
stat reporting where we were ignoring higher 32 bits of stats registers,
so fill out 64 bit stat values into two 32 bit words.
Babu Moger from Oracle improves VF performance issues on SPARC.
Alex Duyck cleans up some of the Hyper-V implementation from KY so that
we can just use function pointers instead of having to identify if a
given VF is running on a Linux or Windows PF.
Usha makes sure that DCB and FCoE is disabled for X550EM_x/a MACs and
cleans up the DCB initialization in the process.
Tony cleans up the API for ixgbevf_update_xcast_mode() so we do not
have to pass in the netdev parameter, since it was never used in the
function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.
This triggers a lockdep splat on 32bit & SMP builds.
We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.
I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.
Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2016-05-04
1) The flowcache can hit an OOM condition if too
many entries are in the gc_list. Fix this by
counting the entries in the gc_list and refuse
new allocations if the value is too high.
2) The inner headers are invalid after a xfrm transformation,
so reset the skb encapsulation field to ensure nobody tries
access the inner headers. Otherwise tunnel devices stacked
on top of xfrm may build the outer headers based on wrong
informations.
3) Add pmtu handling to vti, we need it to report
pmtu informations for local generated packets.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.
To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.
In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.
This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
For test purpose, provide the generic nci loopback function.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to NCI specification, destination type and destination
specific parameters shall uniquely identify a single destination
for the Logical Connection.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE
We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.
skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to perform an additional check on the inner headers to determine if
we can offload the checksum for them. Previously this check didn't occur
so we would generate an invalid frame in the case of an IPv6 header
encapsulated inside of an IPv4 tunnel. To fix this I added a secondary
check to vxlan_features_check so that we can verify that we can offload the
inner checksum.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.
Common functions are for:
- GRE checksum calculation. Move gre_checksum to gre.h.
- Building a GRE header. Move GRE build_header and rename
gre_build_header.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).
These include:
- GRE flag handling functions are move to gre.h
- GRE build_header is moved to gre.h and renamed gre_build_header
- parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
- iptunnel_pull_header is taken out of gre_parse_header. This is now
done by caller. The header length is returned from gre_parse_header
in an int* argument.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export it
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.
Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)
Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.
Kernel should know better, right ?
Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.
As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.
Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.
In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Miller pointed out that fb586f2530 ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.
This patch thus basically inverts the logic on fb586f2530 and signals
it as early as possible, similar to what we had before.
Fixes: fb586f2530 ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
include switch-side statistics, which is useful for debugging purposes,
when the switch is not properly connected to the Ethernet MAC (duplex
mismatch, (RG)MII electrical issues etc.).
We accomplish this by retaining the original copy of the master netdev's
ethtool_ops, and just overload the 3 operations we care about:
get_sset_count, get_strings and get_ethtool_stats so as to intercept
these calls and call into the original master_netdev ethtool_ops, plus
our own.
We take this approach as opposed to providing a set of DSA helper
functions that would retrive the CPU port's statistics, because the
entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
used as CPU conduit interfaces, therefore, statistics overlay in such
drivers would simply not scale.
The new ethtool -S <iface> output would therefore look like this now:
<iface> statistics
p<2 digits cpu port number>_<switch MIB counter names>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211 (which will be the first user of the
fq.h) recently started to support software A-MSDU
aggregation. It glues skbuffs together into a
single one so the backlog accounting needs to be
more fine-grained.
To avoid backlog sorting logic duplication split
it up for re-use.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg. The eor bit
will prevent data from appending to that skb in the future.
The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.
This patch handles the tcp_sendmsg case. The followup patches
will handle other skb coalescing and fragment cases.
One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).
Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1
0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1
0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I accidentally replaced BH disabling by preemption disabling
in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds.
For 64bit stats on 32bit arch, we really need to disable BH,
since the "struct u64_stats_sync syncp" might be manipulated
both from process and BH contexts.
Fixes: 6aef70a851 ("net: snmp: kill various STATS_USER() helpers")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
and equivalent functions, so that sock_wake_async() can send
a SIGIO only when necessary.
Since these atomic operations are really not needed unless
socket expressed interest in FASYNC, we can omit them in most
cases.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async()
so that a SIGIO signal is sent when needed.
tcp_sendmsg() clears the bit.
tcp_poll() sets the bit when stream is not writeable.
We can avoid two atomic operations by first checking if socket
is actually interested in the FASYNC business (most sockets in
real applications do not use AIO, but select()/poll()/epoll())
This also removes one cache line miss to access sk->sk_wq->flags
in tcp_sendmsg()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-04-26
Here's another set of Bluetooth & 802.15.4 patches for the 4.7 kernel:
- Cleanups & refactoring of ieee802154 & 6lowpan code
- Security related additions to ieee802154 and mrf24j40 driver
- Memory corruption fix to Bluetooth 6lowpan code
- Race condition fix in vhci driver
- Enhancements to the atusb 802.15.4 driver
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cfg80211 maintains separate BSS table entries for APs if the same
BSSID, SSID pair is seen on multiple channels, it is possible that it
can map the current_bss to a BSS entry on the wrong channel. This
current_bss will not get flushed unless disconnected and cfg80211
reports a wrong channel as the associated channel.
Fix this by introducing a new cfg80211_connect_bss() function which is
similar to cfg80211_connect_result(), but it includes an additional
parameter: the bss the STA is connected to. This allows drivers to
provide the exact bss entry that matches the BSS to which the connection
was completed.
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for the a station statistics netlink attribute:
NL80211_STA_INFO_RX_DURATION.
If present, this attribute contains the aggregate PPDU duration (in
microseconds) for all the frames from the peer. This is useful to
help understand the total time spent transmitting to us by all of
the connected peers.
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This works on the same implementation principle as
codel*.h, i.e. there's a generic header with
structures and macros and a implementation header
carrying function definitions to include in given,
e.g. driver or module.
The fairness logic comes from
net/sched/sch_fq_codel.c but is generalized so it
is more flexible and easier to re-use.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was impossible to include codel.h for the
purpose of having access to codel_params or
codel_vars structure definitions and using them
for embedding in other more complex structures.
This splits allows codel.h itself to be treated
like any other header file while codel_qdisc.h and
codel_impl.h contain function definitions with
logic that was previously in codel.h.
This copies over copyrights and doesn't involve
code changes other than adding a few additional
include directives to net/sched/sch*codel.c.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This strips out qdisc specific bits from the code
and makes it slightly more reusable. Codel will be
used by wireless/mac80211 in the future.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 751a587ac9 ("route: fix breakage after moving lwtunnel state")
moved lwtstate to the end of dst_entry for 32bit archs. This makes it share
the cacheline with __refcnt which had an unkown effect on performance. For
this reason, the pointer was kept in place for 64bit archs.
However, later performance measurements showed this is of no concern. It
turns out that every performance sensitive path that accesses lwtstate
accesses also struct rtable or struct rt6_info which share the same cache
line.
Thus, to get rid of a few #ifdefs, move the field to the end of the struct
also for 64bit.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d894ba18d4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree. Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist. This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.
Fixes: 1602f49b58 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Valdis reported tons of stack dumps caused by WARN_ON() in
sock_owned_by_user()
This test needs to be relaxed if/when lockdep disables itself.
Note that other lockdep_sock_is_held() callers are all from
rcu_dereference_protected() sections which already are disabled
if/when lockdep has been disabled.
Fixes: fafc4e1ea1 ("sock: tigthen lockdep checks for sock_owned_by_user")
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
IPVS Updates for v4.7
please consider these enhancements to the IPVS. They allow SIP connections
originating from real-servers to be load balanced by the SIP psersitence
engine as is already implemented in the other direction. And for better one
packet scheduling (OPS) performance.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As earlier commit removed accessed to the hash from other files we can
also make it static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We only allow rehash in init namespace, so we only use
init_ns.generation. And even if we would allow it, it makes no sense
as the conntrack locks are global; any ongoing rehash prevents insert/
delete.
So make this private to nf_conntrack_core instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using switchdev deferred operation (SWITCHDEV_F_DEFER), the operation
is executed in different context and the application doesn't have any way
to get the operation real status.
Adding a completion callback fixes that. This patch adds fields to
switchdev_attr and switchdev_obj "complete_priv" field which is used by
the "complete" callback.
Application can set a complete function which will be called once the
operation executed.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:
1) Ensure we don't go over the ruleset blob boundaries in
mark_source_chains().
2) Validate that target jumps land on an existing xt_entry. This extra
sanitization comes with a performance penalty when loading the ruleset.
3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
5) Make sure the minimal possible target size in x_tables.
6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
7) Check that standard target size is valid.
8) More sanitization to ensure that the target_offset field is correct.
9) Add xt_check_entry_match() to validate that matches are well-formed.
10-12) Three patch to reduce the number of parameters in
translate_compat_table() for {arp,ip,ip6}tables by using a container
structure.
13) No need to return value from xt_compat_match_from_user(), so make
it void.
14) Consolidate translate_table() so it can be used by compat code too.
15) Remove obsolete check for compat code, so we keep consistent with
what was already removed in the native layout code (back in 2007).
16) Get rid of target jump validation from mark_source_chains(),
obsoleted by #2.
17) Introduce xt_copy_counters_from_user() to consolidate counter
copying, and use it from {arp,ip,ip6}tables.
18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
functions.
19) Move nf_connlabel_match() to xt_connlabel.
20) Skip event notification if connlabel did not change.
21) Update of nf_connlabels_get() to make the upcoming nft connlabel
support easier.
23) Remove spinlock to read protocol state field in conntrack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With this function, nla_data() is aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
In fact, there is no user of this function.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
The temporary function nla_put_be64_32bit() is removed in this patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() is now aligned on a 64-bit area.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the set of possible HCI bus types with SPI and I2C.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Equivalent to "vxlan: break dependency with netdev drivers", don't
autoload geneve module in case the driver is loaded. Instead make the
coupling weaker by using netdevice notifiers as proxy.
Cc: Jesse Gross <jesse@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently all drivers depend and autoload the vxlan module because how
vxlan_get_rx_port is linked into them. Remove this dependency:
By using a new event type in the netdevice notifier call chain we proxy
the request from the drivers to flush and resetup the vxlan ports not
directly via function call but by the already existing netdevice
notifier call chain.
I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so.
We don't need to save those ids, as the event type field is an unsigned
long and using specialized event types for this purpose seemed to be a
more elegant way. This also comes in beneficial if in future we want to
add offloading knobs for vxlan.
Cc: Jesse Gross <jesse@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having the tag protocol in dsa_switch_driver for setup time and in
dsa_switch_tree for runtime is enough. Remove dsa_switch's one.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netlink messages are appended, one object at a time, to the end of
the SKB. Therefore we need to test skb_tail_pointer() not skb->data
for alignment.
Fixes: 35c5845957 ("net: Add helpers for 64-bit aligning netlink attributes.")
Signed-off-by: David S. Miller <davem@davemloft.net>
HAVE_EFFICIENT_UNALIGNED_ACCESS needs CONFIG_ prefix.
Also add a comment in nla_align_64bit() explaining we have
to add a padding if current skb->data is aligned, as it
certainly can be confusing.
Fixes: 35c5845957 ("net: Add helpers for 64-bit aligning netlink attributes.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using LVS-NAT and SIP persistence-egine over UDP, the following
limitations are present with current implementation:
1) To actually have load-balancing based on Call-ID header, you need to
use one-packet-scheduling mode. But with one-packet-scheduling the
connection is deleted just after packet is forwarded, so SIP responses
coming from real-servers do not match any connection and SNAT is
not applied.
2) If you do not use "-o" option, IPVS behaves as normal UDP load
balancer, so different SIP calls (each one identified by a different
Call-ID) coming from the same ip-address/port go to the same
real-server. So basically you don’t have load-balancing based on
Call-ID as intended.
3) Call-ID is not learned when a new SIP call is started by a real-server
(inside-to-outside direction), but only in the outside-to-inside
direction. This would be a general problem for all SIP servers acting
as Back2BackUserAgent.
This patch aims to solve problems 1) and 3) while keeping OPS mode
mandatory for SIP-UDP, so that 2) is not a problem anymore.
The basic mechanism implemented is to make packets, that do not match any
existent connection but come from real-servers, create new connections
instead of let them pass without any effect.
When such packets pass through ip_vs_out(), if their source ip address and
source port match a configured real-server, a new connection is
automatically created in the same way as it would have happened if the
packet had come from outside-to-inside direction. A new connection template
is created too if the virtual-service is persistent and there is no
matching connection template found. The new connection automatically
created, if the service had "-o" option, is an OPS connection that lasts
only the time to forward the packet, just like it happens on the
ingress side.
The main part of this mechanism is implemented inside a persistent-engine
specific callback (at the moment only SIP persistent engine exists) and
is triggered only for UDP packets, since connection oriented protocols, by
using different set of ports (typically ephemeral ports) to open new
outgoing connections, should not need this feature.
The following requisites are needed for automatic connection creation; if
any is missing the packet simply goes the same way as before.
a) virtual-service is not fwmark based (this is because fwmark services
do not store address and port of the virtual-service, required to
build the connection data).
b) virtual-service and real-servers must not have been configured with
omitted port (this is again to have all data to create the connection).
Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
skb->sk could point to timewait or request socket which has no sk_classid.
Detected as "BUG: KASAN: slab-out-of-bounds in cls_cgroup_classify".
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_connlabel_set() takes the bit number that we would like to set.
nf_connlabels_get() however took the number of bits that we want to
support.
So e.g. nf_connlabels_get(32) support bits 0 to 31, but not 32.
This changes nf_connlabels_get() to take the highest bit that we want
to set.
Callers then don't have to cope with a potential integer wrap
when using nf_connlabels_get(bit + 1) anymore.
Current callers are fine, this change is only to make folloup
nft ct label set support simpler.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently labels can only be set either by iptables connlabel
match or via ctnetlink.
Before adding nftables set support, clean up the clabel core and move
helpers that nft will not need after all to the xtables module.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change the dsa_switch_driver.probe function to return a const char *.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function. This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.
In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.
udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some main variables in sctp.ko, we couldn't export it to other modules,
so we have to define some api to access them.
It will include sctp transport and endpoint's traversal.
There are some transport traversal functions for sctp_diag, we can also
use it for sctp_proc. cause they have the similar situation to traversal
transport.
v2->v3:
- rhashtable_walk_init need the parameter gfp, because of recent upstrem
update
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_diag will dump some important details of sctp's assoc or ep, we use
sctp_info to describe them, sctp_get_sctp_info to get them, and export
it to sctp_diag.ko.
v2->v3:
- we will not use list_for_each_safe in sctp_get_sctp_info, cause
all the callers of it will use lock_sock.
- fix the holes in struct sctp_info with __reserved* field.
because sctp_diag is a new feature, and sctp_info is just for now,
it may be changed in the future.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP already serializes access to rcvbuf through its sock lock:
sctp_recvmsg takes it right in the start and release at the end, while
rx path will also take the lock before doing any socket processing. On
sctp_rcv() it will check if there is an user using the socket and, if
there is, it will queue incoming packets to the backlog. The backlog
processing will do the same. Even timers will do such check and
re-schedule if an user is using the socket.
Simplifying this will allow us to remove sctp_skb_list_tail and get ride
of some expensive lockings. The lists that it is used on are also
mangled with functions like __skb_queue_tail and __skb_unlink in the
same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
sctp_close() will also purge those while using only the sock lock.
Therefore the lockings performed by sctp_skb_list_tail() are not
necessary. This patch removes this function and replaces its calls with
just skb_queue_splice_tail_init() instead.
The biggest gain is at sctp_ulpq_tail_event(), because the events always
contain a list, even if it's queueing a single skb and this was
triggering expensive calls to spin_lock_irqsave/_irqrestore for every
data chunk received.
As SCTP will deliver each data chunk on a corresponding recvmsg, the
more effective the change will be.
Before this patch, with chunks with 30 bytes:
netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
400000 -s 400000 400000
on a 10Gbit link with 1500 MTU:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 137.45 7.34 7.36 52.504 52.608
With it:
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
425984 425984 30 60.00 179.10 7.97 6.70 43.740 36.788
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.
We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
In this patch, I chose to not attach a socket to syncookies skb.
Performance is now 5 Mpps instead of 3.2 Mpps.
Following patch will remove last known false sharing in
tcp_rcv_state_process()
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.
Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).
The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match. They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.
To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list. This will force AF_INET sockets to always be
considered first.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
User needs to monitor shared buffer occupancy. For that, he issues a
snapshot command in order to instruct hardware to catch current and
maximal occupancy values, and clear command in order to clear the
historical maximal values.
Also port-pool and tc-pool-bind command response messages are extended to
carry occupancy values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define userspace API and drivers API for configuration of shared
buffers. Four basic objects are defined:
shared buffer - attributes are size, number of pools and TCs
pool - chunk of sharedbuffer definition, it has some size and either
static or dynamic threshold
port pool threshold - to set per-port threshold for each pool
port tc threshold bind - to bind port and TC to specified pool
with threshold.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The structure can be packed denser by doing minor rearrangement
of existing elements.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently processing of multiple chunks in a single SCTP packet leads to
multiple calls to sk_data_ready, causing multiple wake up signals which
are costy and doesn't make it wake up any faster.
With this patch it will note that the wake up is pending and will do it
before leaving the state machine interpreter, latest place possible to
do it realiably and cleanly.
Note that sk_data_ready events are not dependent on asocs, unlike waking
up writers.
v2: series re-checked
v3: use local vars to cleanup the code, suggested by Jakub Sitnicki
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It wastes space and gets worse as we add new flags, so convert bit-wide
flags to a bitfield.
Currently it already saves 4 bytes in sctp_sock, which are left as holes
in it for now. The whole struct needs packing, which should be done in
another patch.
Note that do_auto_asconf cannot be merged, as explained in the comment
before it.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.
Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to enabled_port_mask, indicating
external enabled ports of the switch, which is hopefully less
confusing.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXDp6hAAoJEGt7eEactAAd/xUQAJtKNwp9CLsx+QFx6lMoXX4x
r0XA8DFgLp1BflS9P05/g1m0NiQxm3YuRtpze/FdPglb6AAVjLcqksf+vTkU+Lng
p7rIkb/fQv5s5aoYPxNrD5zgwALVv9y5fI7rV7scj355iesCC0PmAP34own2Dihi
eBVSammsh5ZNTQKLBk8vXECb0UKWsDBMgp4uQc35Bpw8XSx5Nrtl5JI/hMcckte0
a/FQyQKjmjl3O/nRLn3kzGPv1OnRiJOMb5fMWB+Xm2cLtmKPHIErgVk2l/CMaiYj
sRJR8KaZQpQsyWiQU59UNpywlejy7Z1RsSWmuPhm0xTGzIF1wVIgHJSsRI/gNGD2
8Ey1P+RXkM8NVxrQr/0fis9XWyWfE8ne4tFsPiPOD3VmBiStIB9fAukJHLrvTmKU
JrkXCePUkfNY/PqJqlP/RONBcysI253/snVF49oZ7LMBZiGDPhdRcEEcCaS0tmMM
Qa+a78XvaH5xaKuMIDZ4qMdnMMcdv4g8G1DQeA1mb0EIGL1Gtu9BJsu9q8PqmjQU
1ZAf4MlWJWdYk+CtTNT4slSIQVKAN78s6j1HSB/bNcpWk9y93wBhJW0FdP7FtJ1I
pjJGIVcLU98FKdqi2jqPEezbDXXzOz0gNQDbqfJyM9/R7ijnJcaPllviaWjEg/O7
8jMBOg87Hn7kq7JJGpKA
=2xfe
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of link-layer specific handling for 802.15.4 we need to cast to
802.15.4 sepcific structures. Simple add this header when include the
6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This function will be use in later functionality in other branches than
generic 6lowpan, so we move it to the global 6lowpan header.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch moves the 802.15.4 link layer specific structures to generic
6lowpan. This is necessary for special 802.15.4 6lowpan handling in
6lowpan generic layer.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the naming for interface private data for lowpan
intefaces. The current private data scheme is:
-------------------------------------------------
| 6LoWPAN Generic | LinkLayer 6LoWPAN |
-------------------------------------------------
the current naming schemes are:
- 6LoWPAN Generic:
- lowpan_priv
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_dev
- 802.15.4:
- lowpan_dev_info
the new naming scheme with this patch will be:
- 6LoWPAN Generic:
- lowpan_dev
- LinkLayer 6LoWPAN:
- BTLE
- lowpan_btle_dev
- 802.15.4:
- lowpan_802154_dev
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduce some short address handling functionality into
ieee802154 headers.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
your net-next tree.
1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong.
2) Define and register netfilter's afinfo for the bridge family,
this comes in preparation for native nfqueue's bridge for nft,
from Stephane Bryant.
3) Add new attributes to store layer 2 and VLAN headers to nfqueue,
also from Stephane Bryant.
4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes
coming from userspace, from Stephane Bryant.
5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit
in IPv6 SYNPROXY, from Liping Zhang.
6) Remove unnecessary check for dst == NULL in nf_reject_ipv6,
from Haishuang Yan.
7) Deinline ctnetlink event report functions, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Not performance critical, it is only invoked when an expectation is
added/destroyed.
While at it, kill unused nf_ct_expect_event() wrapper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Way too large; move it to nf_conntrack_ecache.c.
Reduces total object size by 1216 byte on my machine.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This enum is already perfectly aliased to enum nl80211_band, and
the only reason for it is that we get IEEE80211_NUM_BANDS out of
it. There's no really good reason to not declare the number of
bands in nl80211 though, so do that and remove the cfg80211 one.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The roaming cases for the Connect command were not fully covered and
neither Connect nor Associate command uses of the prev_bssid parameter
were very clear. Add details to describe how the prev_bssid argument is
supposed to be used and when the driver should use association or
reassociation.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Vivek reported a kernel exception deleting a VRF with an active
connection through it. The root cause is that the socket has a cached
reference to a dst that is destroyed. Converting the dst_destroy to
dst_release and letting proper reference counting kick in does not
work as the dst has a reference to the device which needs to be released
as well.
I talked to Hannes about this at netdev and he pointed out the ipv4 and
ipv6 dst handling has dst_ifdown for just this scenario. Rather than
continuing with the reinvented dst wheel in VRF just remove it and
leverage the ipv4 and ipv6 versions.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Fixes: 35402e3136 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.
Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).
Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code. A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some miscellaneous bits out into their own file to make it easier to
split the call handling.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.
Example:
[h2] [h3] 15.0.0.5
| |
3| 3|
[SP1] [SP2]--+
1 2 1 2
| | /-------------+ |
| \ / |
| X |
| / \ |
| / \---------------\ |
1 2 1 2
12.0.0.2 [TOR1] 3-----------------3 [TOR2] 12.0.0.3
4 4
\ /
\ /
\ /
-------| |-----/
1 2
[TOR3]
3|
|
[h1] 12.0.0.1
host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...
If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:
root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...
The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.
To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently on high rate SCTP streams the heartbeat timer refresh can
consume quite a lot of resources as timer updates are costly and it
contains a random factor, which a) is also costly and b) invalidates
mod_timer() optimization for not editing a timer to the same value.
It may even cause the timer to be slightly advanced, for no good reason.
As suggested by David Laight this patch now removes this timer update
from hot path by leaving the timer on and re-evaluating upon its
expiration if the heartbeat is still needed or not, similarly to what is
done for TCP. If it's not needed anymore the timer is re-scheduled to
the new timeout, considering the time already elapsed.
For this, we now record the last tx timestamp per transport, updated in
the same spots as hb timer was restarted on tx. Also split up
sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
heartbeat one.
On loopback with MTU of 65535 and data chunks with 1636, so that we
have a considerable amount of chunks without stressing system calls,
netperf -t SCTP_STREAM -l 30, perf looked like this before:
Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
Overhead Command Shared Object Symbol
+ 6,15% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,43% netperf [kernel.vmlinux] [k] _raw_write_unlock_irqrestore
- _raw_write_unlock_irqrestore
- 96,54% _raw_spin_unlock_irqrestore
- 36,14% mod_timer
+ 97,24% sctp_transport_reset_timers
+ 2,76% sctp_do_sm
+ 33,65% __wake_up_sync_key
+ 28,77% sctp_ulpq_tail_event
+ 1,40% del_timer
- 1,84% mod_timer
+ 99,03% sctp_transport_reset_timers
+ 0,97% sctp_do_sm
+ 1,50% sctp_ulpq_tail_event
And after this patch, now with netperf -l 60:
Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
Overhead Command Shared Object Symbol
+ 5,65% netperf [kernel.vmlinux] [k] memcpy_erms
+ 5,59% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string
- 5,05% netperf [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
- _raw_spin_unlock_irqrestore
+ 49,89% __wake_up_sync_key
+ 45,68% sctp_ulpq_tail_event
- 2,85% mod_timer
+ 76,51% sctp_transport_reset_t3_rtx
+ 23,49% sctp_do_sm
+ 1,55% del_timer
+ 2,50% netperf [sctp] [k] sctp_datamsg_from_user
+ 2,26% netperf [sctp] [k] sctp_sendmsg
Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
~3.7%.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_vlan_prepare and port_vlan_add for
simplicity and convenience. If an hardware error occurs during the
commit phase, there is no need to report it outside the driver itself.
Make the DSA port_vlan_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switchdev design implies that a software error should not happen in
the commit phase since it must have been previously reported in the
prepare phase. If an hardware error occurs during the commit phase,
there is nothing switchdev can do about it.
The DSA layer separates port_fdb_prepare and port_fdb_add for simplicity
and convenience. If an hardware error occurs during the commit phase,
there is no need to report it outside the DSA driver itself.
Make the DSA port_fdb_add routine return void for explicitness.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA layer doesn't care about the return code of the port_stp_update
routine, so make it void in the layer and the DSA drivers.
Replace the useless dsa_slave_stp_update function with a
dsa_slave_stp_state function used to reply to the switchdev
SWITCHDEV_ATTR_ID_PORT_STP_STATE attribute.
In the meantime, rename port_stp_update to port_stp_state_set to
explicit the state change.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQ4GAAoJEGt7eEactAAdWiMP/ibaP3I79NDc0s7wCDA+KRkm
hx0Qx4a0wwm7lDFlnGBjY6yKr+XFDliCvdGX7XGpLSsTioNg7eXPpwx5FQoj6RiV
8+5RKE9fTguN9ofUzqAwHd9sVOaxvdlXbKfb/N93Gzjpw/meYk58wXdF7Almkroa
ukgJeMzIlIh+6D96zFEA+Ofzp5chwh+x2Dn0wXutEe9P9fOERA859veAvx65b+Ql
IRGTqyuY5B/wcbkr4o+DWQwgrdt7Vop9nYVPNWtMHm2JTzfuCSaQ2cD9TnVAK/bg
/vtqC46KKNLyBRGexAPqdftY9PWcfipgE+n7k+Et4iGSmNm7Z3dEyewgXmqli7XJ
X8Uiaq+N6Fpe06DVSU7aSRt8NLV64A44jXSfKRI9U2POUqKMn/PMdm8bhPW8qCdM
ra6myWpQGHWK9e0TQQdShq0NQKGxCZAiSRiiIrbbvXl1CwXxkPCG39wAC3Sh1tEN
ou4lGraeywGnTjaq+mwLEtHLoug8Y2x+Fz+Ze4Cu2enXxna9lp4lr+rFlc+2+0Er
o9oPxkTk8krZGIj9M6PNc5W+InMwchaFX3076n67hnFHzFRlOQzkfffbPYlhKJDQ
f8c9JiNZIoX/fD1TAKsrdO1+EKm/xo7w7pLgbMwQal8Jr88SkITDg0i3oXc56vNQ
ZK2gUzwvrD/jh0AUyDfN
=sj7y
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the 4.7 cycle, we have a number of changes:
* Bob's mesh mode rhashtable conversion, this includes
the rhashtable API change for allocation flags
* BSSID scan, connect() command reassoc support (Jouni)
* fast (optimised data only) and support for RSS in mac80211 (myself)
* various smaller changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXBQyAAAoJEGt7eEactAAdymsP/i2zU+VQFpkB8RG+nn/AYogY
x2RXgKjCWOs6FwUcu+VHMx0whKuMADjqbMABdlsGGK62Xa5aYYcObvY+CgCUAI+m
unV7kYDIBUHudiTpxXgZYUvylhIvW37VYjc6BDoaq4Jc1rz/L69zSrNHmoNiQv+Y
113T0Ft5EEmEO1LP4s2GLMZTPqwgi2FnaP6UYFdTr2/ZfaRHlj2xRRG62WiT/q1x
DMT2KWHvETCftpK3GwwkMSr0Au8CVV1soiQOoioTPRaevYbBFVi60GVXQeDlvFEV
PqVCOEfsSvsw84phfHrW1bOxBeVsNYHbY/T4eVlC0zssUzz6KNH5jAfpyla1p0lP
WniSqAaWxMcUYWCEBiOLa5LV2XVpXOuTpI84xcgc/BmprgzNyLgDAiCDtehpxALf
Qmhc/rPR5BbLhTNY8z5qANG6mhQCHCo+52ypvBLMhZoajkPjgyBabwoqaRGje2ub
vgzbAfqEguJmCAszw04KZ2UHInBcCDAZ5aOKiinawWvpftkDN0IJmO/HW5vnh4xV
kawpo1eh1JzDcEyYfSjySHHFmR/5qnaDzaPL2cJthcOY/fGywibJHCQmoDPyH5jz
Bkk0F0rMEcdQWs9pJLIMMzkA7BAlxYLYip0J9QImHL77sWK0QwUDCoryrgD6lL1D
v2V31g1TZwPF2Noe9Rk7
=r3Hr
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
For the current RC series, we have the following fixes:
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to add inline to lockdep_sock_is_held, so it generated all
kinds of build warnings if not build with lockdep support.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the UDP encapsulation GRO functions have been moved to the UDP
socket we not longer need the udp_offload insfrastructure so removing it.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adapt vxlan_gro_receive, vxlan_gro_complete to take a socket argument.
Set these functions in tunnel_config. Don't set udp_offloads any more.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add gro_receive and gro_complete to struct udp_tunnel_sock_cfg.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds GRO functions (gro_receive and gro_complete) to UDP
sockets. udp_gro_receive is changed to perform socket lookup on a
packet. If a socket is found the related GRO functions are called.
This features obsoletes using UDP offload infrastructure for GRO
(udp_offload). This has the advantage of not being limited to provide
offload on a per port basis, GRO is now applied to whatever individual
UDP sockets are bound to. This also allows the possbility of
"application defined GRO"-- that is we can attach something like
a BPF program to a UDP socket to perfrom GRO on an application
layer protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add externally visible functions to lookup a UDP socket by skb. This
will be used for GRO in UDP sockets. These functions also check
if skb->dst is set, and if it is not skb->dev is used to get dev_net.
This allows calling lookup functions before dst has been set on the
skbuff.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_iif check if skb_rtable is NULL for the skb and return
skb->skb_iif if it is.
This change allows inet_iif to be called before the dst
information has been set in the skb (e.g. when doing socket based
UDP GRO).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket is either locked if we hold the slock spin_lock for
lock_sock_fast and unlock_sock_fast or we own the lock (sk_lock.owned
!= 0). Check for this and at the same time improve that the current
thread/cpu is really holding the lock.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During release_sock we use callbacks to finish the processing
of outstanding skbs on the socket. We actually are still locked,
sk_locked.owned == 1, but we already told lockdep that the mutex
is released. This could lead to false positives in lockdep for
lockdep_sock_is_held (we don't hold the slock spinlock during processing
the outstanding skbs).
I took over this patch from Eric Dumazet and tested it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement VXLAN-GPE. Only COLLECT_METADATA is supported for now (it is
possible to support static configuration, too, if there is demand for it).
The GPE header parsing has to be moved before iptunnel_pull_header, as we
need to know the protocol.
v2: Removed what was called "L2 mode" in v1 of the patchset. Only "L3 mode"
(now called "raw mode") is added by this patch. This mode does not allow
Ethernet header to be encapsulated in VXLAN-GPE when using ip route to
specify the encapsulation, IP header is encapsulated instead. The patch
does support Ethernet to be encapsulated, though, using ETH_P_TEB in
skb->protocol. This will be utilized by other COLLECT_METADATA users
(openvswitch in particular).
If there is ever demand for Ethernet encapsulation with VXLAN-GPE using
ip route, it's easy to add a new flag switching the interface to
"Ethernet mode" (called "L2 mode" in v1 of this patchset). For now,
leave this out, it seems we don't need it.
Disallowed more flag combinations, especially RCO with GPE.
Added comment explaining that GBP and GPE cannot be set together.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner
protocol.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: ddf97ccdd7 ("net_sched: add network namespace support for tc actions")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This extends NL80211_CMD_CONNECT to allow the NL80211_ATTR_PREV_BSSID
attribute to be used similarly to way this was already allowed with
NL80211_CMD_ASSOCIATE. This allows user space to request reassociation
(instead of association) when already connected to an AP. This provides
an option to reassociate within an ESS without having to disconnect and
associate with the AP.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Requires software tx queueing and fast-xmit support. For good
performance, drivers need frag_list support as well. This avoids the
need for copying data of aggregated frames. Running without it is only
supported for debugging purposes.
To avoid performance and packet size issues, the rate control module or
driver needs to limit the maximum A-MSDU size by setting
max_rc_amsdu_len in struct ieee80211_sta.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[fix locking issue]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the driver advertises the new HW flag USE_RSS, make the
station statistics on the fast-rx path per-CPU. This will
enable calling the RX in parallel, only hitting locking or
shared cachelines when the fast-RX path isn't available.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes drivers already looked up, or know out-of-band
from their device, which station transmitted a given RX
frame. Allow them to pass the station pointer to mac80211
to save the extra lookup.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Enable peeking at UDP datagrams at the offset specified with socket
option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
to the end of the given datagram.
Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f6e
("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
on peek, decrease it on regular reads.
When peeking, always checksum the packet immediately, to avoid
recomputation on subsequent peeks and final read.
The socket lock is not held for the duration of udp_recvmsg, so
peek and read operations can run concurrently. Only the last store
to sk_peek_off is preserved.
Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove UDP transport headers before queueing packets for reception.
This change simplifies a follow-up patch to add MSG_PEEK support.
Signed-off-by: Sam Kumar <samanthakumar@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the peek offset interface safe to use in lockless environments.
Use READ_ONCE and WRITE_ONCE to avoid race conditions between testing
and updating the peek offset.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use list_* helpers in sctp_list_dequeue, more readable.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Legacy clients don't support P2P power save mechanism, and thus if a P2P GO
has a legacy client connected to it, it should disable P2P PS mechanisms.
Let the driver know about this with a new bss_conf parameter.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Legacy clients don't support P2P power save mechanisms, and thus
if a P2P GO has a legacy client connected to it, it has to make
some changes in the PS behavior.
To handle this, add an attribute to specify whether a station supports
P2P PS or not. If the attribute was not specified cfg80211 will assume
that station supports it for P2P GO interface, and does NOT support it
for AP interface, matching the current assumptions in the code.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch fix a structure name mismatch in cfg80211.h.
Signed-off-by: Moroo Akira <retrage01@gmail.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add documentation for the flag for duplication check.
Fixes the following warning when running make htmldocs:
warning: Enum value 'RX_FLAG_DUP_VALIDATED' not described in enum 'mac80211_rx_flags'
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
[fix description]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introducing a new feature that the driver can use to
indicate the driver/firmware supports configuration of BSS
selection criteria upon CONNECT command. This can be useful
when multiple BSS-es are found belonging to the same ESS,
ie. Infra-BSS with same SSID. The criteria can then be used to
offload selection of a preferred BSS.
Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Reviewed-by: Lei Zhang <leizh@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
[move wiphy support check into parse_bss_select()]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some devices, like iwlwifi, have RSS queues. This may cause a
situation where a disassociation is handled in control path and
results in station removal while there are prior RX frames
that were still not processed in other queues. When they will
be processed the station will be gone, and the frames will be
dropped.
Add a synchronization interface to avoid that. When driver returns
from the synchronization mac80211 may remove the station.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows scans for a specific BSSID to be optimized by the user space
application by requesting the driver to set the Probe Request frame
BSSID field (Address 3) to the specified BSSID instead of the wildcard
BSSID. This prevents other APs from replying which reduces airtime need
and latency in getting the response from the target AP through.
This is an optimization and as such, it is acceptable for some of the
drivers not to support the mechanism. If not supported, the wildcard
BSSID will be used and more responses may be received.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When HW crypto is used, there's no need for the CCMP/GCMP MIC to
be available to mac80211, and the hardware might have removed it
already after checking. The MIC is also useless to have when the
frame is already decrypted, so allow indicating that it's not
present.
Since we are running out of bits in mac80211_rx_flags, make
the flags field a u64.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This was requested by Android, and the appropriate cfg80211 API
had been added by Dmitry. Support it in mac80211, allowing drivers
to provide the timestamp.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Goal: packets dropped by a listener are accounted for.
This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
so that children do not inherit their parent drop count.
Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
We already have a separate LINUX_MIB_SYNCOOKIESSENT
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.
Following patch takes care of listeners drops.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.
Peak performance under SYNFLOOD is increased by ~33% :
On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We'll soon no longer take a refcount on listeners,
so reqsk_alloc() can not assume a listener refcount is not
zero. We need to use atomic_inc_not_zero()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since linux 2.6.29, lookups only use rcu locking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.
This actually removes a lot of complexity in UDP stack.
Multicast receives no longer need to hold a bucket spinlock.
Note that ip early demux still needs to take a reference on the socket.
Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.
Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.
Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.
v2: Addressed Willem de Bruijn feedback in multicast handling
- keep early demux break in __udp4_lib_demux_lookup()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want a generic way to insert an RCU grace period before socket
freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
much overhead.
SLAB_DESTROY_BY_RCU strict rules force us to take a reference
on the socket sk_refcnt, and it is a performance problem for UDP
encapsulation, or TCP synflood behavior, as many CPUs might
attempt the atomic operations on a shared sk_refcnt
UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
lookup can use traditional RCU rules, without refcount changes.
They can set the flag only once hashed and visible by other cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.
Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.
This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.
This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?
Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.
This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.
Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.
This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.
This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.
To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.
To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.
Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.
This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.
On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.
By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.
v2 changes: revised commit message on bogus ACKs and burst, and
missing signature
Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As ping_v6_sendmsg is used only in this file,
making it static
The body of "pingv6_prot" and "pingv6_protosw" were
moved at the middle of the file, to avoid having to
declare some static prototypes.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct in6_addr isn't used anymore in inet6_connection_sock.h, removing
the forward declaration.
Fixes: 1b33bc3e9e ("ipv6: remove obsolete inet6 functions")
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse reports false positives for the header manipulation inlines. Annotate
them correctly.
Tested by sparse on a little endian and big endian machine.
Fixes: 54bfd872bf ("vxlan: keep flags and vni in network byte order")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet is either locally encapsulated or processed through GRO
it is marked with the offloads that it requires. However, when it is
decapsulated these tunnel offload indications are not removed. This
means that if we receive an encapsulated TCP packet, aggregate it with
GRO, decapsulate, and retransmit the resulting frame on a NIC that does
not support encapsulation, we won't be able to take advantage of hardware
offloads even though it is just a simple TCP packet at this point.
This fixes the problem by stripping off encapsulation offload indications
when packets are decapsulated.
The performance impacts of this bug are significant. In a test where a
Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated,
and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a
result of avoiding unnecessary segmentation at the VM tap interface.
Reported-by: Ramu Ramamurthy <sramamur@linux.vnet.ibm.com>
Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user supply a different fragmentation point or if there is a
network header that cause it to not be aligned, force it to be aligned.
Fragmentation point at a value that is not aligned is not optimal. It
causes extra padding to be used and has just no pros.
v2:
- Make use of the new WORD_TRUNC macro
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP is a protocol that is aligned to a word (4 bytes). Thus using bare
MTU can sometimes return values that are not aligned, like for loopback,
which is 65536 but ipv4_mtu() limits that to 65535. This mis-alignment
will cause the last non-aligned bytes to never be used and can cause
issues with congestion control.
So it's better to just consider a lower MTU and keep congestion control
calcs saner as they are based on PMTU.
Same applies to icmp frag needed messages, which is also fixed by this
patch.
One other effect of this is the inability to send MTU-sized packet
without queueing or fragmentation and without hitting Nagle. As the
check performed at sctp_packet_can_append_data():
if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead)
/* Enough data queued to fill a packet */
return SCTP_XMIT_OK;
with the above example of MTU, if there are no other messages queued,
one cannot send a packet that just fits one packet (65532 bytes) and
without causing DATA chunk fragmentation or a delay.
v2:
- Added WORD_TRUNC macro
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
flowi6_tos of struct flowi6 is unused in IPv6, therefore dumping tos on
that tracepoint will also give incorrect information wrt traffic class.
If we want to fix it, we need to extract it via ip6_tclass(flp->flowlabel).
While for the same test case I get a count of 0 non-zero tos values before
the change, they now start to show up after the change:
# ./perf record -e fib6:fib6_table_lookup -a sleep 10
# ./perf script | grep -v "tos 0" | wc -l
60
Since there's no user in the kernel tree anymore of flowi6_tos, remove the
define to avoid any future confusion on this.
Fixes: b811580d91 ("net: IPv6 fib lookup tracepoint")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri mentioned that flowi6_tos of struct flowi6 is never used/read
anywhere. In fact, rest of the kernel uses the flowi6's flowlabel,
where the traffic class _and_ the flowlabel (aka flowinfo) is encoded.
For example, for policy routing, fib6_rule_match() uses ip6_tclass()
that is applied on the flowlabel member for matching on tclass. Similar
fix is needed for geneve, where flowi6_tos is set as well. Installing
a v6 blackhole rule that f.e. matches on tos is now working with vxlan.
Fixes: 1400615d64 ("vxlan: allow setting ipv6 traffic class")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Highlights:
1) Support more Realtek wireless chips, from Jes Sorenson.
2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.
3) Make several TCP sysctls per-namespace, from Nikolay Borisov.
4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.
5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.
6) Add 802.1AE MACSEC support, from Sabrina Dubroca.
7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.
8) Add and use SKB bulk free infrastructure, from Jesper Brouer.
9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.
10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.
11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.
12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.
13) Hardware offload support for flower classifiers, from Amir Vadai.
14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...
bond_get_stats() can be called from rtnetlink (with RTNL held)
or from /proc/net/dev seq handler (with RCU held)
The logic added in commit 5f0c5f73e5 ("bonding: make global bonding
stats more reliable") kind of assumed only one cpu could run there.
If multiple threads are reading /proc/net/dev, stats can be really
messed up after a while.
A second problem is that some fields are 32bit, so we need to properly
handle the wrap around problem.
Given that RTNL is not always held, we need to use
bond_for_each_slave_rcu().
Fixes: 5f0c5f73e5 ("bonding: make global bonding stats more reliable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF defines this as BPF_TUNLEN_MAX and OVS just uses the hard-coded
value inside struct sw_flow_key. Thus, add and use IP_TUNNEL_OPTS_MAX
for this, which makes the code a bit more generic and allows to remove
BPF_TUNLEN_MAX from eBPF code.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can just add a small helper dst_tclassid() for retrieving the
dst->tclassid value. It makes the code a bit better in that we can
get rid of the ifdef from filter.c by moving this into the header.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.6:
API:
- Convert remaining crypto_hash users to shash or ahash, also convert
blkcipher/ablkcipher users to skcipher.
- Remove crypto_hash interface.
- Remove crypto_pcomp interface.
- Add crypto engine for async cipher drivers.
- Add akcipher documentation.
- Add skcipher documentation.
Algorithms:
- Rename crypto/crc32 to avoid name clash with lib/crc32.
- Fix bug in keywrap where we zero the wrong pointer.
Drivers:
- Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
- Add PIC32 hwrng driver.
- Support BCM6368 in bcm63xx hwrng driver.
- Pack structs for 32-bit compat users in qat.
- Use crypto engine in omap-aes.
- Add support for sama5d2x SoCs in atmel-sha.
- Make atmel-sha available again.
- Make sahara hashing available again.
- Make ccp hashing available again.
- Make sha1-mb available again.
- Add support for multiple devices in ccp.
- Improve DMA performance in caam.
- Add hashing support to rockchip"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
crypto: qat - remove redundant arbiter configuration
crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
crypto: qat - Change the definition of icp_qat_uof_regtype
hwrng: exynos - use __maybe_unused to hide pm functions
crypto: ccp - Add abstraction for device-specific calls
crypto: ccp - CCP versioning support
crypto: ccp - Support for multiple CCPs
crypto: ccp - Remove check for x86 family and model
crypto: ccp - memset request context to zero during import
lib/mpi: use "static inline" instead of "extern inline"
lib/mpi: avoid assembler warning
hwrng: bcm63xx - fix non device tree compatibility
crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
crypto: qat - The AE id should be less than the maximal AE number
lib/mpi: Endianness fix
crypto: rockchip - add hash support for crypto engine in rk3288
crypto: xts - fix compile errors
crypto: doc - add skcipher API documentation
crypto: doc - update AEAD AD handling
...
We can hit an OOM condition if we are under presure because
we can not free the entries in gc_list fast enough. So add
a counter for the not yet freed entries in the gc_list and
refuse new allocations if the value is too high.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS/OVS updates for net-next
The following patchset contains Netfilter/IPVS fixes and OVS NAT
support, more specifically this batch is composed of:
1) Fix a crash in ipset when performing a parallel flush/dump with
set:list type, from Jozsef Kadlecsik.
2) Make sure NFACCT_FILTER_* netlink attributes are in place before
accessing them, from Phil Turnbull.
3) Check return error code from ip_vs_fill_iph_skb_off() in IPVS SIP
helper, from Arnd Bergmann.
4) Add workaround to IPVS to reschedule existing connections to new
destination server by dropping the packet and wait for retransmission
of TCP syn packet, from Julian Anastasov.
5) Allow connection rescheduling in IPVS when in CLOSE state, also
from Julian.
6) Fix wrong offset of SIP Call-ID in IPVS helper, from Marco Angaroni.
7) Validate IPSET_ATTR_ETHER netlink attribute length, from Jozsef.
8) Check match/targetinfo netlink attribute size in nft_compat,
patch from Florian Westphal.
9) Check for integer overflow on 32-bit systems in x_tables, from
Florian Westphal.
Several patches from Jarno Rajahalme to prepare the introduction of
NAT support to OVS based on the Netfilter infrastructure:
10) Schedule IP_CT_NEW_REPLY definition for removal in
nf_conntrack_common.h.
11) Simplify checksumming recalculation in nf_nat.
12) Add comments to the openvswitch conntrack code, from Jarno.
13) Update the CT state key only after successful nf_conntrack_in()
invocation.
14) Find existing conntrack entry after upcall.
15) Handle NF_REPEAT case due to templates in nf_conntrack_in().
16) Call the conntrack helper functions once the conntrack has been
confirmed.
17) And finally, add the NAT interface to OVS.
The batch closes with:
18) Cleanup to use spin_unlock_wait() instead of
spin_lock()/spin_unlock(), from Nicholas Mc Guire.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_upper_dev_unlink() which notifies NETDEV_CHANGEUPPER, returns
void, as well as del_nbp(). So there's no advantage to catch an eventual
error from the port_bridge_leave routine at the DSA level.
Make this routine void for the DSA layer and its existing drivers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename DSA port_join_bridge and port_leave_bridge routines to
respectively port_bridge_join and port_bridge_leave in order to respect
an implicit Port::Bridge namespace.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-03-12
Here's the last bluetooth-next pull request for the 4.6 kernel.
- New USB ID for AR3012 in btusb
- New BCM2E55 ACPI ID
- Buffer overflow fix for the Add Advertising command
- Support for a new Bluetooth LE limited privacy mode
- Fix for firmware activation in btmrvl_sdio
- Cleanups to mac802154 & 6lowpan code
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data). Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).
The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.
Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.
v6: Rebase on the latest net-next
v5: Eric pointed out that checking skb->len is still needed in
tcp_fastopen_add_skb() because skb can carry a FIN without data.
Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
helper is used. Comment is added to the fastopen case to explain why
segs_in has to be reset and tcp_segs_in() has to be called before
__skb_pull().
v4: Add comment to the changes in tcp_fastopen_add_skb()
and also add remark on this case in the commit message.
v3: Add const modifier to the skb parameter in tcp_segs_in()
v2: Rework based on recent fix by Eric:
commit a9d99ce28e ("tcp: fix tcpi_segs_in after connection establishment")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This basic implementation allows to share code between driver using
hardware buffer management. As the code is hardware agnostic, there is
few helpers, most of the optimization brought by the an HW BM has to be
done at driver level.
Tested-by: Sebastian Careba <nitroshift@yahoo.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates csum_ipv6_magic so that it correctly recognizes that
protocol is a unsigned 8 bit value.
This will allow us to better understand what limitations may or may not be
present in how we handle the data. For example there are a number of
places that call htonl on the protocol value. This is likely not necessary
and can be replaced with a multiplication by ntohl(1) which will be
converted to a shift by the compiler.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently sctp_sendmsg() triggers some calls that will allocate memory
with GFP_ATOMIC even when not necessary. In the case of
sctp_packet_transmit it will allocate a linear skb that will be used to
construct the packet and this may cause sends to fail due to ENOMEM more
often than anticipated specially with big MTUs.
This patch thus allows it to inherit gfp flags from upper calls so that
it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
similar. All others, like retransmits or flushes started from BH, are
still allocated using GFP_ATOMIC.
In netperf tests this didn't result in any performance drawbacks when
memory is not too fragmented and made it trigger ENOMEM way less often.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for csum_block_add was doing a funky byteswap to swap the even and
odd bytes of the checksum if the offset was odd. Instead of doing this we
can save ourselves some trouble and just shift by 8 as this should have the
same effect in terms of the final checksum value and only requires one
instruction.
In addition we can update csum_block_sub to just use csum_block_add with a
inverse value for csum2. This way we follow the same code path as
csum_block_add without having to duplicate it.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds support for setting the IPv6 flow label for vxlan per
device and through collect metadata (ip_tunnel_key) frontends. The
vxlan dst cache does not need any special considerations here, for
the cases where caches can be used, the label is static per cache.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cast pointer to unsigned long instead of u64, to fix compilation warning
on 32 bit arch, spotted by 0day build.
Fixes: 5b33f48 ("net/flower: Introduce hardware offload support")
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
please consider these IPVS fixes for v4.5 or
if it is too late please consider them for v4.6.
* Arnd Bergman has corrected an error whereby the SIP persistence engine
may incorrectly access protocol fields
* Julian Anastasov has corrected a problem reported by Jiri Bohac with the
connection rescheduling mechanism added in 3.10 when new SYNs in
connection to dead real server can be redirected to another real server.
* Marco Angaroni resolved a problem in the SIP persistence engine
whereby the Call-ID could not be found if it was at the beginning of a
SIP message.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Enable device drivers to query the action, if and only if is a mark
action and what value to use for marking.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce the macros tc_no_actions and tc_for_each_action to make code
clearer.
Extracted struct tc_action out of the ifdef to make calls to
is_tcf_gact_shot() and similar functions valid, even when it is a nop.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will be used in a following patch to query if a key is being used, and
what it's value in the target object.
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is based on a patch made by John Fastabend.
It adds support for offloading cls_flower.
when NETIF_F_HW_TC is on:
flags = 0 => Rule will be processed twice - by hardware, and if
still relevant, by software.
flags = SKIP_HW => Rull will be processed by software only
If hardware fail/not capabale to apply the rule, operation will NOT
fail. Filter will be processed by SW only.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Suggested-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stub helper functions for the newly added kcm_proc_init/exit interfaces
are defined as 'static' in a header file, which leads to build warnings for
each file that includes them without calling them:
include/net/kcm.h:183:12: error: 'kcm_proc_init' defined but not used [-Werror=unused-function]
include/net/kcm.h:184:13: error: 'kcm_proc_exit' defined but not used [-Werror=unused-function]
This marks the two functions as 'static inline' instead, which avoids the
warnings and is obviously what was meant here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: cd6e111bf5 ("kcm: Add statistics and proc interfaces")
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a limited privacy mode indicated by value 0x02 to the mgmt
Set Privacy command.
With value 0x02 the kernel will use privacy mode with a resolvable
private address. In case the controller is bondable and discoverable
the identity address will be used.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the swap pointer and memmove functionality. Instead
we use the well known put/get unaligned access with specific byte order
handling.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.
The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.
The receive algorithm is something like:
1) Receive the first skbuf for a message (or skbufs if multiple are
needed to determine message length).
2) Check the message length against the number of bytes in the TCP
receive queue (tcp_inq()).
- If all the bytes of the message are in the queue (incluing the
skbuf received), then proceed with message assembly (it should
complete with the tcp_read_sock)
- Else, mark the psock with the number of bytes needed to
complete the message.
3) In TCP data ready function, if the psock indicates that we are
waiting for the rest of the bytes of a messages, check the number
of queued bytes against that.
- If there are still not enough bytes for the message, just
return
- Else, clear the waiting bytes and proceed to receive the
skbufs. The message should now be received in one
tcp_read_sock
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.
The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module implements the Kernel Connection Multiplexor.
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.
For more information see the included Documentation/networking/kcm.txt
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.
This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.
Fixes: 14ca0751c9 ("bpf: support for access to tunnel options")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of our customers observed issues with FIB6 garbage collectors
running in different network namespaces blocking each other, resulting
in soft lockups (fib6_run_gc() initiated from timer runs always in
forced mode).
Now that FIB6 walkers are separated per namespace, there is no more need
for instances of fib6_run_gc() in different namespaces blocking each
other. There is still a call to icmp6_dst_gc() which operates on shared
data but this function is protected by its own shared lock.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 FIB data structures are separated per network namespace but
there is still only one global walkers list and one global walker list
lock. This means changes in one namespace unnecessarily interfere with
walkers in other namespaces.
Replace the global list with per-netns lists (and give each its own
lock).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported that sctp_add_bind_addr may read more bytes than
expected in case the parameter is a IPv4 addr supplied by the user
through calls such as sctp_bindx_add(), because it always copies
sizeof(union sctp_addr) while the buffer may be just a struct
sockaddr_in, which is smaller.
This patch then fixes it by limiting the memcpy to the min between the
union size and a (new parameter) provided addr size. Where possible this
parameter still is the size of that union, except for reading from
user-provided buffers, which then it accounts for protocol type.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next tree,
they are:
1) Remove useless debug message when deleting IPVS service, from
Yannick Brosseau.
2) Get rid of compilation warning when CONFIG_PROC_FS is unset in
several spots of the IPVS code, from Arnd Bergmann.
3) Add prandom_u32 support to nft_meta, from Florian Westphal.
4) Remove unused variable in xt_osf, from Sudip Mukherjee.
5) Don't calculate IP checksum twice from netfilter ipv4 defrag hook
since fixing af_packet defragmentation issues, from Joe Stringer.
6) On-demand hook registration for iptables from netns. Instead of
registering the hooks for every available netns whenever we need
one of the support tables, we register this on the specific netns
that needs it, patchset from Florian Westphal.
7) Add missing port range selection to nf_tables masquerading support.
BTW, just for the record, there is a typo in the description of
5f6c253ebe ("netfilter: bridge: register hooks only when bridge
interface is added") that refers to the cluster match as deprecated, but
it is actually the CLUSTERIP target (which registers hooks
inconditionally) the one that is scheduled for removal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The assumptions from commit 0c1d70af92 ("net: use dst_cache for vxlan
device"), 468dfffcd7 ("geneve: add dst caching support") and 3c1cb4d260
("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
when ip_tunnel_info is used is unfortunately not always valid as assumed.
While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
with different remote dsts, tos, etc, therefore they cannot be assumed as
packet independent.
Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
would reuse the same entry that was first created on the initial route
lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
a different one.
Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
in vxlan needs to be handeled differently in this context as it is currently
inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
helper is added for the three tunnel cases, which checks if we can use dst
cache.
Fixes: 0c1d70af92 ("net: use dst_cache for vxlan device")
Fixes: 468dfffcd7 ("geneve: add dst caching support")
Fixes: 3c1cb4d260 ("net/ipv4: add dst cache support for gre lwtunnels")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d672345ed ("bpf: add generic bpf_csum_diff helper") added a
generic checksum diff helper that can feed bpf_l4_csum_replace() with
a target __wsum diff that is to be applied to the L4 checksum. This
facility is very flexible, can be cascaded, allows for adding, removing,
or diffing data, or for calculating the pseudo header checksum from
scratch, but it can also be reused for working with the IPv4 header
checksum.
Thus, analogous to bpf_l4_csum_replace(), add a case for header field
value of 0 to change the checksum at a given offset through a new helper
csum_replace_by_diff(). Also, in addition to that, this provides an
easy to use interface for feeding precalculated diffs f.e. coming from
a map. It nicely complements bpf_l3_csum_replace() that currently allows
only for csum updates of 2 and 4 byte diffs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Bohac is reporting for a problem where the attempt
to reschedule existing connection to another real server
needs proper redirect for the conntrack used by the IPVS
connection. For example, when IPVS connection is created
to NAT-ed real server we alter the reply direction of
conntrack. If we later decide to select different real
server we can not alter again the conntrack. And if we
expire the old connection, the new connection is left
without conntrack.
So, the only way to redirect both the IPVS connection and
the Netfilter's conntrack is to drop the SYN packet that
hits existing connection, to wait for the next jiffie
to expire the old connection and its conntrack and to rely
on client's retransmission to create new connection as
usually.
Jiri Bohac provided a fix that drops all SYNs on rescheduling,
I extended his patch to do such drops only for connections
that use conntrack. Here is the original report from Jiri Bohac:
Since commit dc7b3eb900 ("ipvs: Fix reuse connection if real server
is dead"), new connections to dead servers are redistributed
immediately to new servers. The old connection is expired using
ip_vs_conn_expire_now() which sets the connection timer to expire
immediately.
However, before the timer callback, ip_vs_conn_expire(), is run
to clean the connection's conntrack entry, the new redistributed
connection may already be established and its conntrack removed
instead.
Fix this by dropping the first packet of the new connection
instead, like we do when the destination server is not available.
The timer will have deleted the old conntrack entry long before
the first packet of the new connection is retransmitted.
Fixes: dc7b3eb900 ("ipvs: Fix reuse connection if real server is dead")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Some devices declare a high number of TX queues, then set a much
lower real_num_tx_queues
This cause setups using fq_codel, sfq or fq as the default qdisc to consume
more memory than really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-03-01
Here's our main set of Bluetooth & 802.15.4 patches for the 4.6 kernel.
- New Bluetooth HCI driver for Intel/AG6xx controllers
- New Broadcom ACPI IDs
- LED trigger support for indicating Bluetooth powered state
- Various fixes in mac802154, 6lowpan and related drivers
- New USB IDs for AR3012 Bluetooth controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.
Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.
Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.
Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.
YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress
xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify
xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue
xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok
xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
match 00000000/00000000 at 0 (success 0 )
action order 1: ife decode action reclassify
index 1 ref 1 bind 1 installed 14 sec used 14 sec
type: 0x0
Metadata: allow mark allow hash allow prio allow qmap
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules
YYYY: Lets show the sender side now ..
xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx: ethertype 0xdead
xxx: add skb->mark to whitelist of metadatum to send
xxx: rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15
xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 0 success 0)
match c0a87aed/ffffffff at 16 (success 0 )
match 00010000/00ff0000 at 8 (success 0 )
action order 1: skbedit mark 17
index 6 ref 1 bind 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: ife encode action pipe
index 3 ref 1 bind 1
dst MAC: 02:15:15:15:15:15 type: 0xDEAD
Metadata: allow mark
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
test by sending ping from sender to destination
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user explicitly requests VLAN filtering with something like:
# echo 1 > /sys/class/net/<bridge>/bridge/vlan_filtering
Switchdev propagates a SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING port
attribute.
Add support for it in the DSA layer with a new port_vlan_filtering
function to let drivers toggle 802.1Q filtering on user demand.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce devlink infrastructure for drivers to register and expose to
userspace via generic Netlink interface.
There are two basic objects defined:
devlink - one instance for every "parent device", for example switch ASIC
devlink port - one instance for every physical port of the device.
This initial portion implements basic get/dump of objects to userspace.
Also, port splitter and port type setting is implemented.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the initial implementation the only way to stop a rule from being
inserted into the hardware table was via the device feature flag.
However this doesn't work well when working on an end host system
where packets are expect to hit both the hardware and software
datapaths.
For example we can imagine a rule that will match an IP address and
increment a field. If we install this rule in both hardware and
software we may increment the field twice. To date we have only
added support for the drop action so we have been able to ignore
these cases. But as we extend the action support we will hit this
example plus more such cases. Arguably these are not even corner
cases in many working systems these cases will be common.
To avoid forcing the driver to always abort (i.e. the above example)
this patch adds a flag to add a rule in software only. A careful
user can use this flag to build software and hardware datapaths
that work together. One example we have found particularly useful
is to use hardware resources to set the skb->mark on the skb when
the match may be expensive to run in software but a mark lookup
in a hash table is cheap. The idea here is hardware can do in one
lookup what the u32 classifier may need to traverse multiple lists
and hash tables to compute. The flag is only passed down on inserts.
On deletion to avoid stale references in hardware we always try
to remove a rule if it exists.
The flags field is part of the classifier specific options. Although
it is tempting to lift this into the generic structure doing this
proves difficult do to how the tc netlink attributes are implemented
along with how the dump/change routines are called. There is also
precedence for putting seemingly generic pieces in the specific
classifier options such as TCA_U32_POLICE, TCA_U32_ACT, etc. So
although not ideal I've left FLAGS in the u32 options as well as it
simplifies the code greatly and user space has already learned how
to manage these bits ala 'tc' tool.
Another thing if trying to update a rule we require the flags to
be unchanged. This is to force user space, software u32 and
the hardware u32 to keep in sync. Thanks to Simon Horman for
catching this case.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original series drivers would get offload requests for cls_u32
rules even if the feature bit is disabled. This meant the driver had
to do a boiler plate check on the feature bit before adding/deleting
the rule.
This patch lifts the check into the core code and removes it from the
driver specific case.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The offload decision was originally very basic and tied to if the dev
implemented the appropriate ndo op hook. The next step is to allow
the user to more flexibly define if any paticular rule should be
offloaded or not. In order to have this logic in one function lift
the current check into a helper routine tc_should_offload().
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove nearly duplicated code and prepare for the following patch.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Lamparter noted a use case where the source address selection fails
to pick an address from a VRF interface - unnumbered interfaces.
Relevant commands from his script:
ip addr add 9.9.9.9/32 dev lo
ip link set lo up
ip link add name vrf0 type vrf table 101
ip rule add oif vrf0 table 101
ip rule add iif vrf0 table 101
ip link set vrf0 up
ip addr add 10.0.0.3/32 dev vrf0
ip link add name dummy2 type dummy
ip link set dummy2 master vrf0 up
--> note dummy2 has no address - unnumbered device
ip route add 10.2.2.2/32 dev dummy2 table 101
ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02
tcpdump -ni dummy2 &
And using ping instead of his socat example:
$ ping -I vrf0 -c1 10.2.2.2
ping: Warning: source address might be selected on device other than vrf0.
PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.
>From tcpdump:
12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64
Note the source address is from lo and is not a VRF local address. With
this patch:
$ ping -I vrf0 -c1 10.2.2.2
PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.
>From tcpdump:
12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64
Now the source address comes from vrf0.
The ipv4 function for selecting source address takes a const argument.
Removing the const requires touching a lot of places, so instead
l3mdev_master_ifindex_rcu is changed to take a const argument and then
do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
This is similar to what l3mdev_fib_table_rcu does.
IPv6 for unnumbered interfaces appears to be selecting the addresses
properly.
Cc: David Lamparter <david@opensourcerouting.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VLAN GetNext operation is specific to some switches, and thus can be
complicated to implement for some drivers.
Remove the support for the vlan_getnext/port_pvid_get approach in favor
of the generic and simpler port_vlan_dump function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to port_fdb_dump, add a port_vlan_dump function to DSA drivers
which gets passed the switchdev VLAN object and callback.
This function, if implemented, takes precedence over the soon legacy
vlan_getnext/port_pvid_get approach.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tc actions are stored in a per-module hashtable,
therefore are visible to all network namespaces. This is
probably the last part of the tc subsystem which is not
aware of netns now. This patch makes them per-netns,
several tc action API's need to be adjusted for this.
The tc action API code is ugly due to historical reasons,
we need to refactor that code in the future.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only release the memory of the hashtable itself, not its
entries inside. This is not a problem yet since we only call
it in module release path, and module is refcount'ed by
actions. This would be a problem after we move the per module
hinfo into per netns in the latter patch.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWzCthAAoJEGt7eEactAAdLjEP/38ujypJkVlwZzm9L8MbSVPk
qPMmRBshVX89xOXtwJLJHOGCRY6/pfe1LKThodOp+hj+NH7OLLuAGFDScjr7eEjG
bAQrXZ8dDIo54uri4EF4jBfkqUsePs+d4VpLgdJ2Gdh2UgIdDlUcggiEyOSicc27
ZP3gRQzad8tP0O+MuqiAjwJE4C3AUzzj+8TXJJXjrT3RvDpvZfA3MClfiQVhRM+4
3qtFhqFFGfIjTSHCoGr+0RYhLJwfjalIyoENR/kYizjHXU4YtZcu7ihdixKmG3Qb
SUIlNl60WSyoh2+PXx3wp0gfTc0BrOVIpB+TQPouRu5D+tLRiPwlieVrlfJyIUJ5
ccU1RWoTZn+3OVSjNF+9IGbpL43cjaJpZUMU/mQES7+z+uEwlzdcnwDScG7sMMC+
KIyKZ4v+2sBPKYYKAzGxHb2lfTMyp9sCs25U30yqLX3RSNtuxJbDwuKrwbKa1CMx
WEjImI+iSApjTch0THAxV8/sO/ZneuRZP76PXmSDm7Mzf9qob8fnpCjFsHTypDpx
eWy8AHPhIhb3ni6Cm3TfsGLjGE4KGwD/lEwmAles2Cj48N4b2UwRjMrJ5GRxuV00
9kh2MvFEmSYBU4dw6LQmemjxNNIa792+JiS4AKJXBqkPK1H1YFvq7i8Hl9a1tQ0P
SDgbH5x6AYvmTRcD97Ph
=01n9
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Another small set of fixes:
* stop critical protocol session on disconnect to avoid
it getting stuck
* wext: fix two RTNL message ordering issues
* fix an uninitialized value (found by KASAN)
* fix an out-of-bounds access (also found by KASAN)
* clear connection keys when freeing them in all cases
(IBSS, all other places already did so)
* fix expected throughput unit to get consistent values
* set default TX aggregation timeout to 0 in minstrel
to avoid (really just hide) issues and perform better
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers may need to track which vif is using VHT MU-MIMO.
Move the flag indicationg the ownership of MU_MIMO to
ieee80211_vif.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide an interface to the lower level driver to set the VHT
MU-MIMO data. This is needed for example when there is an update
of the group data during low power state, where the management
frame will not be passed to the host at all.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since the PNs of all the tx keys are now tracked in the public
part of the key struct (with atomic counter), we no longer
need these functions.
dvm and vt665{5,6} are currently the only users of these functions,
so update them accordingly.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers/devices might want to set the IVs by
themselves (and still let mac80211 generate MMIC).
Specifically, this is needed when the device does
offloading at certain times, and the driver has
to make sure that the IVs of new tx frames (from
the host) are synchronized with IVs that were
potentially used during the offloading.
Similarly to CCMP, move the TX IVs of TKIP keys to the
public part of the key struct, and export a function
to add the IV right into the crypto header.
The public tx_pn field is defined as atomic64, so define
TKIP_PN_TO_IV16/32 helper macros to convert it to iv16/32
when needed.
Since the iv32 used for the p1k cache is taken
directly from the frame, we can safely remove
iv16/32 from being protected by tkip.txlock.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
PBSS (Personal Basic Service Set) is a new BSS type for DMG
networks. It is similar to infrastructure BSS, having an AP-like
entity called PCP (PBSS Control Point), but it has few differences.
PBSS support is mandatory for 11ad devices.
Add support for PBSS by introducing a new PBSS flag attribute.
The PBSS flag is used in the START_AP command to request starting
a PCP instead of an AP, and in the CONNECT command to request
connecting to a PCP instead of an AP.
Signed-off-by: Lior David <liord@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If any frames are dropped that are part of a BA session, the reorder
buffer will "indefinitely" (until the timeout) wait for them to come
in (or a BAR moving the window) and won't release frames after them.
This means it isn't possible to filter frames within a BA session in
firmware.
Introduce an API function that allows such filtering. Calling this
function will move the BA window forward to the new SSN, and allows
marking frames after the SSN as having been filtered, so any future
reordering activity will release frames while skipping the holes.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This will allow drivers to make more educated
decisions whether to defer transmission or not.
Relying on wake_tx_queue() call count implicitly
was not possible because it could be called
without queued frame count actually changing on
software tx aggregation start/stop code paths.
It was also not possible to know how long
byte-wise queue was without dequeueing.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers/devices without their own rate control algorithm can get the
information what rates they should use from either the radiotap header of
injected frames or from the rate control algorithm. But the parsing of the
legacy rate information from the radiotap header was removed in commit
e6a9854b05 ("mac80211/drivers: rewrite the rate control API").
The removal of this feature heavily reduced the usefulness of frame
injection when wanting to simulate specific transmission behavior. Having
rate parsing together with MCS rates and retry support allows a fine
grained selection of the tx behavior of injected frames for these kind of
tests.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The timestamp given by iwlwifi is at the beginning of the
frame over the air, at (or during) the SYNC field. Allow
such timestamps to be given to mac80211, at least (for now)
for frames with non-HT/VHT preambles.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make the addr parameter const in SET_IEEE80211_PERM_ADDR() to save
clients from having to cast away a const qualifier.
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In VHT, the specification allows to limit the number of
MSDUs in an A-MSDU in the Extended Capabilities IE. There
is also a limitation on the byte size in the VHT IE.
In HT, the only limitation is on the byte size.
Parse the capabilities from the peer and make them
available to the driver.
In HT, there is another limitation when a BA agreement
is active: the byte size can't be greater than 4095.
This is not enforced here.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers offload some frames internally (e.g.
AddBa). Reporting such frames to mac80211 would
only confuse MLME. However it would be useful to
be able to pass such frames to monitor interfaces
for sniffing purposes, e.g. when running AP +
monitor.
To do that allow drivers to tell mac80211 whether
a given frame should be:
- processed but not delivered to any monitor vif
- not processed but delievered to monitor vifs
only
Signed-off-by: Grzegorz Bajorski <grzegorz.bajorski@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Enable driver to manage the reordering logic itself.
This is needed for example for the iwlwifi driver that
will support hardware assisted reordering.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some DSA drivers may or may not support multiple software bridges on top
of an hardware switch.
It is more convenient for them to access the bridge's net_device for
finer configuration.
Removing the need to craft and access a bitmask also simplifies the
code.
This patch changes the signature of bridge related functions, update DSA
drivers, and removes dsa_slave_br_port_mask.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduce support for IPHC stateful address compression. It
will offer the context table via one debugfs entry.
This debugfs has and directory for each cid entry for the context table.
Inside each cid directory there exists the following files:
- "active": If the entry is added or deleted. The context table is
original a list implementation, this flag will indicate if the
context is part of list or not.
- "prefix": The ipv6 prefix.
- "prefix_length": The prefix length for the prefix.
- "compression": The compression flag according RFC6775.
This part should be moved into sysfs after some testing time.
Also the debugfs entry contains a "show" file which is a pretty-printout
for the current context table information.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
I got report about that sometimes the WARN_ON occurs there which should
never happen. I came to the conclusion that the mac header is there but
inside the headroom of skb. The skb->len information doesn't contain the
information about the headroom length and skb->len is lesser than two.
We check now if the skb_mac_header pointer is set and the room between
mac header pointer and tail pointer.
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add support for LED triggers to the Bluetooth subsystem and add kernel
config symbol BT_LEDS for it.
For now one trigger for indicating "HCI is powered up" is supported.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that if UDP CSUM is not specified we will default
to enabling it. The main motivation behind this is the fact that with the
use of outer checksum we can greatly improve the performance for VXLAN
tunnels on devices that don't know how to parse tunnel headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.
Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.
Only users with the CAP_NET_ADMIN capability can cause modules to be
loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
messages for users without this capability, and by
lwtunnel_build_state not being called in response to RTM_GETxxx
messages.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
follows up commit 45f6fad84c ("ipv6: add complete rcu protection around
np->opt") which added mixed rcu/refcount protection to np->opt.
Given the current implementation of rcu_pointer_handoff(), this has no
effect at runtime.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
skb_scrub_packet directly instead.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
convert the vni to tun_id correctly.
Fixes: 54bfd872bf ("vxlan: keep flags and vni in network byte order")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit bb9b18fb55 ("genl: Add genlmsg_new_unicast() for
unicast message allocation")'.
Nothing wrong with it; its no longer needed since this was only for
mmapped netlink support.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya reported following lockdep splat:
kernel: =========================
kernel: [ BUG: held lock freed! ]
kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
kernel: -------------------------
kernel: swapper/5/0 is freeing memory
ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
kernel: 4 locks held by swapper/5/0:
kernel: #0: (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
netif_receive_skb_internal+0x4b/0x1f0
kernel: #1: (rcu_read_lock){......}, at: [<ffffffff816e977f>]
ip_local_deliver_finish+0x3f/0x380
kernel: #2: (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
sk_clone_lock+0x19b/0x440
kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
[<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
To properly fix this issue, inet_csk_reqsk_queue_add() needs
to return to its callers if the child as been queued
into accept queue.
We also need to make sure listener is still there before
calling sk->sk_data_ready(), by holding a reference on it,
since the reference carried by the child can disappear as
soon as the child is put on accept queue.
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Fixes: ebb516af60 ("tcp/dccp: fix race at listener dismantle phase")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prevent repeated conversions from and to network order in the fast path.
To achieve this, define all flag constants in big endian order and store VNI
as __be32. To prevent confusion between the actual VNI value and the VNI
field from the header (which contains additional reserved byte), strictly
distinguish between "vni" and "vni_field".
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, pointer to the vxlan header is kept in a local variable. It has
to be reloaded whenever the pskb pull operations are performed which usually
happens somewhere deep in called functions.
Create a vxlan_hdr function and use it to reference the vxlan header
instead.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 8b570dc9f7 ("sctp: only drop the reference on the datamsg
after sending a msg") used sctp_datamsg_put in sctp_sendmsg, instead of
sctp_datamsg_free, this function has no use in sctp.
So we will remove it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a helper function drivers can use to learn if the
action type is a drop action.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows netdev drivers to consume cls_u32 offloads via
the ndo_setup_tc ndo op.
This works aligns with how network drivers have been doing qdisc
offloads for mqprio.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of UDP traffic with datagram length
below MTU this give about 2% performance increase
when tunneling over ipv4 and about 60% when tunneling
over ipv6
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of UDP traffic with datagram length
below MTU this give about 3% performance increase
when tunneling over ipv4 and about 70% when
tunneling over ipv6.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current ip_tunnel cache implementation is prone to a race
that will cause the wrong dst to be cached on cuncurrent dst cache
miss and ip tunnel update via netlink.
Replacing with the generic implementation fix the issue.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also fix a potential race into the existing tunnel code, which
could lead to the wrong dst to be permanenty cached:
CPU1: CPU2:
<xmit on ip6_tunnel>
<cache lookup fails>
dst = ip6_route_output(...)
<tunnel params are changed via nl>
dst_cache_reset() // no effect,
// the cache is empty
dst_cache_set() // the wrong dst
// is permanenty stored
// into the cache
With the new dst implementation the above race is not possible
since the first cache lookup after dst_cache_reset will fail due
to the timestamp check
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a generic, lockless dst cache implementation.
The need for lock is avoided updating the dst cache fields
only in per cpu scope, and requiring that the cache manipulation
functions are invoked with the local bh disabled.
The refresh_ts and reset_ts fields are used to ensure the cache
consistency in case of cuncurrent cache update (dst_cache_set*) and
reset operation (dst_cache_reset).
Consider the following scenario:
CPU1: CPU2:
<cache lookup with emtpy cache: it fails>
<get dst via uncached route lookup>
<related configuration changes>
dst_cache_reset()
dst_cache_set()
The dst entry set passed to dst_cache_set() should not be used
for later dst cache lookup, because it's obtained using old
configuration values.
Since the refresh_ts is updated only on dst_cache lookup, the
cached value in the above scenario will be discarded on the next
lookup.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
All users now pass false, so we can remove it, and remove the code that
was conditional upon it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only protocol affected at present is Geneve.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This was initially introduced in df2cf4a78e ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.
Signed-off-by: David S. Miller <davem@davemloft.net>
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access. This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet. Previously, every socket in the group
needed to be considered before handing off the incoming packet.
This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support fast lookups for TCP sockets with SO_REUSEPORT,
the function that adds sockets to the listening hash set needs
to be able to check receive address equality. Since this equality
check is different for IPv4 and IPv6, we will need two different
socket hashing functions.
This patch adds inet6_hash identical to the existing inet_hash function
and updates the appropriate references. A following patch will
differentiate the two by passing different comparison functions to
__inet_hash.
Additionally, in order to use the IPv6 address equality function from
inet6_hashtables (which is compiled as a built-in object when IPv6 is
enabled) it also needs to be in a built-in object file as well. This
moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support fast reuseport lookups in TCP, the hash function
defined in struct proto must be capable of returning an error code.
This patch changes the function signature of all related hash functions
to return an integer and handles or propagates this return value at
all call sites.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets. 4.3 introduced netdevs corresponding
to tunnel vports. These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated. The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.
Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.
Signed-off-by: David Wragg <david@weave.works>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the bonding allows to set ad_actor_system and prio while the
bond device is down, but these are actually applied only if there aren't
any slaves yet (applied to bond device when first slave shows up, and to
slaves at 3ad bind time). After this patch changes are applied immediately
and the new values can be used/seen after the bond's upped so it's not
necessary anymore to release all and enslave again to see the changes.
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.
To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.
Fixes: 712f4aad40 ("unix: properly account for FDs passed over unix sockets")
Reported-by: David Herrmann <dh.herrmann@gmail.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCO and GBP are VXLAN extensions, not specified in RFC 7348. Because of
that, they need to be explicitly enabled when creating vxlan interface. By
default, those extensions are not used and plain VXLAN header is sent and
received.
Reflect this in vxlan.h: first, the plain VXLAN header is defined. Following
it, RCO is documented and defined, and likewise for GBP.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VNI_HASH_BITS and VNI_HASH_SIZE are defined twice. Remove the extra
definitions.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
include/net/vxlan.h is a kernel header, no need to prefix fixed size types
with double underscore.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.
It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.
Fixes: 61d2bcae99 ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN
This patch adds support for the client side :
If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.
Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.
Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.
The server side might be implemented on linux later.
Reported-by: Sara Dickinson <sara@sinodun.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"This looks like a lot but it's a mixture of regression fixes as well
as fixes for longer standing issues.
1) Fix on-channel cancellation in mac80211, from Johannes Berg.
2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
module, from Eric Dumazet.
3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
Dumazet.
4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
bound, from Craig Gallek.
5) GRO key comparisons don't take lightweight tunnels into account,
from Jesse Gross.
6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
Dumazet.
7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
register them, otherwise the NEWLINK netlink message is missing
the proper attributes. From Thadeu Lima de Souza Cascardo.
8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
Schimmel
9) Handle fragments properly in ipv4 easly socket demux, from Eric
Dumazet.
10) Don't ignore the ifindex key specifier on ipv6 output route
lookups, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
tcp: avoid cwnd undo after receiving ECN
irda: fix a potential use-after-free in ircomm_param_request
net: tg3: avoid uninitialized variable warning
net: nb8800: avoid uninitialized variable warning
net: vxge: avoid unused function warnings
net: bgmac: clarify CONFIG_BCMA dependency
net: hp100: remove unnecessary #ifdefs
net: davinci_cpdma: use dma_addr_t for DMA address
ipv6/udp: use sticky pktinfo egress ifindex on connect()
ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
netlink: not trim skb for mmaped socket when dump
vxlan: fix a out of bounds access in __vxlan_find_mac
net: dsa: mv88e6xxx: fix port VLAN maps
fib_trie: Fix shift by 32 in fib_table_lookup
net: moxart: use correct accessors for DMA memory
ipv4: ipconfig: avoid unused ic_proto_used symbol
bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
bnxt_en: Ring free response from close path should use completion ring
net_sched: drr: check for NULL pointer in drr_dequeue
...
Johan Hedberg says:
====================
pull request: bluetooth 2016-01-30
Here's a set of important Bluetooth fixes for the 4.5 kernel:
- Two fixes to 6LoWPAN code (one fixing a potential crash)
- Fix LE pairing with devices using both public and random addresses
- Fix allocation of dynamic LE PSM values
- Fix missing COMPATIBLE_IOCTL for UART line discipline
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit d46a9d678e ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.
This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.
ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cfg80211 frequently takes actions from its netdev notifier
call, wireless extensions messages could still be ordered badly
since the wext netdev notifier, since wext is built into the
kernel, runs before the cfg80211 netdev notifier. For example,
the following can happen:
5: wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default
link/ether 02:00:00:00:01:00 brd ff:ff:ff:ff:ff:ff
5: wlan1: <BROADCAST,MULTICAST,UP>
link/ether
when setting the interface down causes the wext message.
To also fix this, export the wireless_nlevent_flush() function
and also call it from the cfg80211 notifier.
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Having proper defines makes the code a bit readable, it also avoids
duplicating hard-coded values since these are also needed when
auto-allocating PSM values (in a subsequent patch).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
After we use refcnt to check if transport is alive, the dead can be
removed from sctp_transport.
The traversal of transport_addr_list in procfs dump is using
list_for_each_entry_rcu, no need to check if it has been freed.
sctp_generate_t3_rtx_event and sctp_generate_heartbeat_event is
protected by sock lock, it's not necessary to check dead, either.
also, the timers are cancelled when sctp_transport_free() is
called, that it doesn't wait for refcnt to reach 0 to cancel them.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when __sctp_lookup_association is running in BH, it will try to
check if t->dead is set, but meanwhile other CPUs may be freeing this
transport and this assoc and if it happens that
__sctp_lookup_association checked t->dead a bit too early, it may think
that the association is still good while it was already freed.
So we fix this race by using atomic_add_unless in sctp_transport_hold.
After we get one transport from hashtable, we will hold it only when
this transport's refcnt is not 0, so that we can make sure t->asoc
cannot be freed before we hold the asoc again.
Note that sctp association is not freed using RCU so we can't use
atomic_add_unless() with it as it may just be too late for that either.
Fixes: 4f00878126 ("sctp: apply rhashtable api to send/recv path")
Reported-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces uses of the long obsolete hash interface with
ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
This patch replaces uses of the long obsolete hash interface with
shash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David S. Miller <davem@davemloft.net>
The cgroup methods are no longer used after baac50bbc3 ("net:
tcp_memcontrol: simplify linkage between socket and page counter").
The hunk to delete them was included in the original patch but must
have gotten lost during conflict resolution on the way upstream.
Fixes: baac50bbc3 ("net: tcp_memcontrol: simplify linkage between socket and page counter")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO is currently not aware of tunnel metadata generated by lightweight
tunnels and stored in the dst. This leads to two possible problems:
* Incorrectly merging two frames that have different metadata.
* Leaking of allocated metadata from merged frames.
This avoids those problems by comparing the tunnel information before
merging, similar to how we handle other metadata (such as vlan tags),
and releasing any state when we are done.
Reported-by: John <john.phillips5@hpe.com>
Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
network subsys. Let's move it to memcontrol.c. This also allows us to
reuse generic code for handling legacy memcg files.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series adds accounting of the historical "kmem" memory consumers to
the cgroup2 memory controller.
These consumers include the dentry cache, the inode cache, kernel stack
pages, and a few others that are pointed out in patch 7/8. The
footprint of these consumers is directly tied to userspace activity in
common workloads, and so they have to be part of the minimally viable
configuration in order to present a complete feature to our users.
The cgroup2 interface of the memory controller is far from complete, but
this series, along with the socket memory accounting series, provides
the final semantic changes for the existing memory knobs in the cgroup2
interface, which is scheduled for initial release in the next merge
window.
This patch (of 8):
Remove unused css argument frmo memcg_init_kmem()
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we need to lock all buckets in the connection hashtable we'd attempt to
lock 1024 spinlocks, which is way more preemption levels than supported by
the kernel. Furthermore, this behavior was hidden by checking if lockdep is
enabled, and if it was - use only 8 buckets(!).
Fix this by using a global lock and synchronize all buckets on it when we
need to lock them all. This is pretty heavyweight, but is only done when we
need to resize the hashtable, and that doesn't happen often enough (or at all).
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Marc Dionne discovered a NULL pointer dereference when setting
SO_REUSEPORT on a socket after it is bound.
This patch removes the assumption that at least one socket in the
reuseport group is bound with the SO_REUSEPORT option before other
bind calls occur.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In file included from net/ipv4/tcp_ipv4.c:77 (and many more):
include/net/tcp_memcontrol.h:5: warning: ‘struct cgroup_subsys’ declared inside parameter list
include/net/tcp_memcontrol.h:5: warning: its scope is only this definition or declaration, which is probably not what you want
Add forward declarations for all used structures to fix this.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"A quick set of bug fixes after there initial networking merge:
1) Netlink multicast group storage allocator only was tested with
nr_groups equal to 1, make it work for other values too. From
Matti Vaittinen.
2) Check build_skb() return value in macb and hip04_eth drivers, from
Weidong Wang.
3) Don't leak x25_asy on x25_asy_open() failure.
4) More DMA map/unmap fixes in 3c59x from Neil Horman.
5) Don't clobber IP skb control block during GSO segmentation, from
Konstantin Khlebnikov.
6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.
7) Fix SKB segment utilization estimation in xen-netback, from David
Vrabel.
8) Fix lockdep splat in bridge addrlist handling, from Nikolay
Aleksandrov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
bgmac: Fix reversed test of build_skb() return value.
bridge: fix lockdep addr_list_lock false positive splat
net: smsc: Add support h8300
xen-netback: free queues after freeing the net device
xen-netback: delete NAPI instance when queue fails to initialize
xen-netback: use skb to determine number of required guest Rx requests
net: sctp: Move sequence start handling into sctp_transport_get_idx()
ipv6: update skb->csum when CE mark is propagated
net: phy: turn carrier off on phy attach
net: macb: clear interrupts when disabling them
sctp: support to lookup with ep+paddr in transport rhashtable
net: hns: fixes no syscon error when init mdio
dts: hisi: fixes no syscon fault when init mdio
net: preserve IP control block during GSO segmentation
fsl/fman: Delete one function call "put_device" in dtsec_config()
hip04_eth: fix missing error handle for build_skb failed
3c59x: fix another page map/single unmap imbalance
3c59x: balance page maps and unmaps
x25_asy: Free x25_asy on x25_asy_open() failure.
mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
...
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.
It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks. Move it to the memory
controller code and give it a more generic name.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and link
sockets directly to their owning memory cgroup.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.
Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.
Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable. However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either. However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed). When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.
As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit. Use the latter directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure.
Since there is no soft limit phase in tcp_memcontrol, and memory
pressure is only entered when packets are already dropped, this is
actually dead code. Remove it.
As this is the last user of parent_cg_proto(), remove that too.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a cgroup currently breaches its socket memory limit, it enters
memory pressure mode for itself and its *ancestors*. This throttles
transmission in unrelated sibling and cousin subtrees that have nothing
to do with the breached limit.
On the contrary, breaching a limit should make that group and its
*children* enter memory pressure mode. But this happens already, albeit
lazily: if an ancestor limit is breached, siblings will enter memory
pressure on their own once the next packet arrives for them.
So no additional hierarchy code is needed. Remove the bogus stuff.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When charging socket memory, the code currently checks only the local
page counter for excess to determine whether the memcg is under socket
pressure. But even if the local counter is fine, one of the ancestors
could have breached its limit, which should also force this child to
enter socket pressure. This currently doesn't happen.
Fix this by using page_counter_try_charge() first. If that fails, it
means that either the local counter or one of the ancestors are in
excess of their limit, and the child should enter socket pressure.
Fixes: 3e32cb2e0a ("mm: memcontrol: lockless page counters")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently mac80211 does not inform the driver of the session
block ack timeout when starting a rx aggregation session.
Drivers that manage the reorder buffer need to know this
parameter.
Seeing that there are now too many arguments for the
drv_ampdu_action() function, wrap them inside a structure.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
It's not always necessary to set the status.freq field, for example
when this would be an expensive calculation. It must be set for all
management frames (as they might be reported to userspace), but for
data frames it's not really required. Document this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently mac80211 does not inform the driver of the window
size when starting an RX aggregation session.
To enable managing the reorder buffer in the driver or hardware
the window size is needed.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add an option for driver to check for packet duplication
by itself.
This is needed for example by the iwlwifi driver which
parallelizes the RX path and does the duplication check
per queue.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Group ID Management frame is an Action frame of
category VHT. It is transmitted by the AP to assign
or change the user position of a STA for one or more
group IDs.
Process and save the group membership data. Notify
underlying driver of changes.
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/ethernet/mellanox/mlxsw/spectrum.h
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
can hide the ugly ifdef.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the final part required to namespaceify the tcp
keep alive mechanism.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to have full tcp keepalive mechanism namespace
support.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp tunnel offloads tend to aggregate datagrams based on inner
headers. gro engine gets notified by tunnel implementations about
possible offloads. The match is solely based on the port number.
Imagine a tunnel bound to port 53, the offloading will look into all
DNS packets and tries to aggregate them based on the inner data found
within. This could lead to data corruption and malformed DNS packets.
While this patch minimizes the problem and helps an administrator to find
the issue by querying ip tunnel/fou, a better way would be to match on
the specific destination ip address so if a user space socket is bound
to the same address it will conflict.
Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define HW multicast entry: MAC and VID.
Using a MAC address simplifies support for both IPV4 and IPv6.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_multipath_hash() computes a hash using __be32 values, force
cast these to u32 to pacify sparse.
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) Release nf_tables objects on netns destructions via
nft_release_afinfo().
2) Destroy basechain and rules on netdevice removal in the new netdev
family.
3) Get rid of defensive check against removal of inactive objects in
nf_tables.
4) Pass down netns pointer to our existing nfnetlink callbacks, as well
as commit() and abort() nfnetlink callbacks.
5) Allow to invert limit expression in nf_tables, so we can throttle
overlimit traffic.
6) Add packet duplication for the netdev family.
7) Add forward expression for the netdev family.
8) Define pr_fmt() in conntrack helpers.
9) Don't leave nfqueue configuration on inconsistent state in case of
errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
him.
10) Skip queue option handling after unbind.
11) Return error on unknown both in nfqueue and nflog command.
12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.
13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
from Carlos Falgueras.
14) Add support for 64 bit byteordering changes nf_tables, from Florian
Westphal.
15) Add conntrack byte/packet counter matching support to nf_tables,
also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2016-01-08
Here's one more bluetooth-next pull request for the 4.5 kernel:
- Support for CRC check and promiscuous mode for CC2520
- Fixes to btmrvl driver
- New ACPI IDs for hci_bcm driver
- Limited Discovery support for the Bluetooth mgmt interface
- Minor other cleanups here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
User data is stored at after 'nft_set_ops' private data into 'data[]'
flexible array. The field 'udata' points to user data and 'udlen' stores
its length.
Add new flag NFTA_SET_USERDATA.
Signed-off-by: Carlos Falgueras García <carlosfg@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Adding vlan_filtering attribute to allow hardware vendor to support
vlan-aware bridges. Vlan_filtering is a per-bridge attribute.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user was removed in commit
029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations").
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
transport hashtable will replace the association hashtable,
so association hashtable is not used in sctp any more, so
drop the codes about that.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tranport hashtbale will replace the association hashtable to do the
lookup for transport, and then get association by t->assoc, rhashtable
apis will be used because of it's resizable, scalable and using rcu.
lport + rport + paddr will be the base hashkey to locate the chain,
with net to protect one netns from another, then plus the laddr to
compare to get the target.
this patch will provider the lookup functions:
- sctp_epaddr_lookup_transport
- sctp_addrs_lookup_transport
hash/unhash functions:
- sctp_hash_transport
- sctp_unhash_transport
init/destroy functions:
- sctp_transport_hashtable_init
- sctp_transport_hashtable_destroy
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the mgmt Start Limited Discovery command. Most
of existing Start Discovery code is reused since the only difference
is the presence of a 'limited' flag as part of the discovery state.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To make the EIR parsing helper more general purpose, make it return
the found data and its length rather than just saying whether the data
was present or not.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Commands run in a vrf context are not failing as expected on a route lookup:
root@kenny:~# ip ro ls table vrf-red
unreachable default
root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
ping: Warning: source address might be selected on device other than vrf-red.
PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.
--- 10.100.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
Since the vrf table does not have a route for 10.100.1.254 the ping
should have failed. The saddr lookup causes a full VRF table lookup.
Propogating a lookup failure to the user allows the command to fail as
expected:
root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
connect: No route to host
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group. These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.
This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include a struct sock_reuseport instance when a UDP socket binds to
a specific address for the first time with the reuseport flag set.
When selecting a socket for an incoming UDP packet, use the information
available in sock_reuseport if present.
This required adding an additional field to the UDP source address
equality function to differentiate between exact and wildcard matches.
The original use case allowed wildcard matches when checking for
existing port uses during bind. The new use case of adding a socket
to a reuseport group requires exact address matching.
Performance test (using a machine with 2 CPU sockets and a total of
48 cores): Create reuseport groups of varying size. Use one socket
from this group per user thread (pinning each thread to a different
core) calling recvmmsg in a tight loop. Record number of messages
received per second while saturating a 10G link.
10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct sock_reuseport is an optional shared structure referenced by each
socket belonging to a reuseport group. When a socket is bound to an
address/port not yet in use and the reuseport flag has been set, the
structure will be allocated and attached to the newly bound socket.
When subsequent calls to bind are made for the same address/port, the
shared structure will be updated to include the new socket and the
newly bound socket will reference the group structure.
Usually, when an incoming packet was destined for a reuseport group,
all sockets in the same group needed to be considered before a
dispatching decision was made. With this structure, an appropriate
socket can be found after looking up just one socket in the group.
This shared structure will also allow for more complicated decisions to
be made when selecting a socket (eg a BPF filter).
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moving the caller of iptunnel_xmit_stats causes a build error in
randconfig builds that disable CONFIG_INET:
In file included from ../net/xfrm/xfrm_input.c:17:0:
../include/net/ip6_tunnel.h: In function 'ip6tunnel_xmit':
../include/net/ip6_tunnel.h:93:2: error: implicit declaration of function 'iptunnel_xmit_stats' [-Werror=implicit-function-declaration]
iptunnel_xmit_stats(dev, pkt_len);
The reason is that the iptunnel_xmit_stats definition is hidden
inside #ifdef CONFIG_INET but the caller is not. We can change
one or the other to fix it, and this patch adds a second #ifdef
around ip6tunnel_xmit() to avoid seeing the invalid call.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()")
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWgsv0AAoJEIqAPN1PVmxKvukP/3eJwA+chAUF89/fqwqFTaJN
fffLoOxx2OBIbTXD2VV36yw4bAo9tbKDAYBZiot3Ig7Kg0SeCJ5oPA9xCVnWPrEY
hxFAldvl+lWQs8nrUOgUZItiFBeUPdfW9YX/yKhUZVc+602nUG/e/+6x8B5MhIce
SAfgCyd0c16DApltP7sw1muyZMvsO6Ow6dyNzDUVYZuabvEhe3SLSj9KFJi7Thsp
h41Iv+bwPLhwF4RXGA6rei/gdEDSMRohprdj3uTDiTarGW+OpcAO0zWACS5m4eR0
zF19+HjGPUk/LpFRaU31xX0ZQQjOTmmfsOt4FBb3P7oJx47egycsadLYPexeG7nj
ruyS6ezlRX1I/tZsnyLNJK92mK5TXYLz2uJ8r2ii/BgPNE+AErB3zKCC+EjXzWhh
AvClGu5b88WJLxoq3I3l5evPwGhebGZ8N/1uiFsHOxvzKVLgxwOmNLRGN4XXxB2i
UbIHgBb6smsu/l+3q9R83kfoMaoWnr+OUIi2QPQVDt/K7t1LfsCuIhzcGSgo1VuW
fGlA1iu+CNDknofeCl4JDo2UXAETO4gdKWw87GXeUcbbraLUczZeO7FFLZqxbMYc
OCaPYshmVFeZRypYdRWDHw67ivj0/h+9iq4PP1XOROkRFH746dD/p4yamJwVi20B
samZ8VPwzgH3/ohQJyX3
=VFUH
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.5 pull request
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The ieee802154_llsec_ops structure is never modified, so declare it as
const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
You can use this to duplicate packets and inject them at the egress path
of the specified interface. This duplication allows you to inspect
traffic from the dummy or any other interface dedicated to this purpose.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support for missing HCI event EVT_CONNECTIVITY and forward
it to userspace.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
If the netdevice is destroyed, the resources that are attached should
be released too as they belong to the device that is now gone.
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have to release the existing objects on netns removal otherwise we
leak them. Chains are unregistered in first place to make sure no
packets are walking on our rules and sets anymore.
The object release happens by when we unregister the family via
nft_release_afinfo() which is called from nft_unregister_afinfo() from
the corresponding __net_exit path in every family.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoids cluttering tcp_v4_send_reset when followup patch extends
it to deal with timewait sockets.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains the first batch of Netfilter updates for
the upcoming 4.5 kernel. This batch contains userspace netfilter header
compilation fixes, support for packet mangling in nf_tables, the new
tracing infrastructure for nf_tables and cgroup2 support for iptables.
More specifically, they are:
1) Two patches to include dependencies in our netfilter userspace
headers to resolve compilation problems, from Mikko Rapeli.
2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.
3) Remove duplicate include in the netfilter reject infrastructure,
from Stephen Hemminger.
4) Two patches to simplify the netfilter defragmentation code for IPv6,
patch from Florian Westphal.
5) Fix root ownership of /proc/net netfilter for unpriviledged net
namespaces, from Philip Whineray.
6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.
7) Add mangling support to our nf_tables payload expression, from
Patrick McHardy.
8) Introduce a new netlink-based tracing infrastructure for nf_tables,
from Florian Westphal.
9) Change setter functions in nfnetlink_log to be void, from
Rami Rosen.
10) Add netns support to the cttimeout infrastructure.
11) Add cgroup2 support to iptables, from Tejun Heo.
12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.
13) Add support for mangling pkttype in the nf_tables meta expression,
also from Florian.
BTW, I need that you pull net into net-next, I have another batch that
requires changes that I don't yet see in net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow accepted sockets to derive their sk_bound_dev_if setting from the
l3mdev domain in which the packets originated. A sysctl setting is added
to control the behavior which is similar to sk_mark and
sysctl_tcp_fwmark_accept.
This effectively allow a process to have a "VRF-global" listen socket,
with child sockets bound to the VRF device in which the packet originated.
A similar behavior can be achieved using sk_mark, but a solution using marks
is incomplete as it does not handle duplicate addresses in different L3
domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
domain provides a complete solution.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helper to lookup l3mdev master index given a device index.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Docbook does not like the definition of macros inside a field declaration
and adds a warning. Move the definition out.
Fixes: 79462ad02e ("net: add validation for the socket syscall protocol argument")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an op that the drivers can call into to get existing
geneve ports.
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we all know, the value of pf_retrans >= max_retrans_path can
disable pf state. The variables of pf_retrans and max_retrans_path
can be changed by the userspace application.
Sometimes the user expects to disable pf state while the 2
variables are changed to enable pf state. So it is necessary to
introduce a new variable to disable pf state.
According to the suggestions from Vlad Yasevich, extra1 and extra2
are removed. The initialization of pf_enable is added.
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern added a vif field in the a4 part of inetpeer_addr struct.
This broke IPv4 TCP fast open client side and more generally tcp metrics
cache, because inetpeer_addr_cmp() is now comparing two u32 instead of
one.
inetpeer_set_addr_v4() needs to properly init vif field, otherwise
the comparison result depends on uninitialized data.
Fixes: 192132b9a0 ("net: Add support for VRFs to inetpeer cache")
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer. It does not include any implementation code.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator. The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).
The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.
This patch:
- Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
NETIF_F_ALL_CSUM is being used as a mask).
- Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev drivers need to know the netdev on which the switchdev op was
invoked. For example, the STP state of a VLAN interface configured on top
of a port can change while being member in a bridge. In this case, the
underlying driver should only change the STP state of that particular
VLAN and not of all the VLANs configured on the port.
However, current switchdev infrastructure only passes the port netdev down
to the driver. Solve that by passing the original device down to the
driver as part of the required switchdev object / attribute.
This doesn't entail any change in current switchdev drivers. It simply
enables those supporting stacked devices to know the originating device
and act accordingly.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Wilder reported crashes caused by dst reuse.
<quote David>
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
</quote>
Thanks to his detailed report and analysis, we were able to understand
the core issue.
IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.
When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.
Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.
We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.
This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.
It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.
Can probably be backported back to linux-3.6 kernels
Reported-by: David J. Wilder <dwilder@us.ibm.com>
Tested-by: David J. Wilder <dwilder@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-12-11
Here's another set of Bluetooth & 802.15.4 patches for the 4.5 kernel:
- 6LoWPAN debugfs support
- New 802.15.4 driver for ADF7242 MAC IEEE802154
- Initial code for 6LoWPAN Generic Header Compression (GHC) support
- Refactor Bluetooth LE scan & advertising behind dedicated workqueue
- Cleanups to Bluetooth H:5 HCI driver
- Support for Toshiba Broadcom based Bluetooth controllers
- Use continuous scanning when establishing Bluetooth LE connections
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
郭永刚 reported that one could simply crash the kernel as root by
using a simple program:
int socket_fd;
struct sockaddr_in addr;
addr.sin_port = 0;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_family = 10;
socket_fd = socket(10,3,0x40000000);
connect(socket_fd , &addr,16);
AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.
This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.
kernel: Call Trace:
kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110
kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200
kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10
kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89
I found no particular commit which introduced this problem.
CVE: CVE-2015-8543
Cc: Cong Wang <cwang@twopensource.com>
Reported-by: 郭永刚 <guoyonggang@360.cn>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resolve conflict between commit 264640fc2c ("ipv6: distinguish frag
queues by device for multicast and link-local packets") from the net
tree and commit 029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations") from the nf-next tree.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/ipv6/netfilter/nf_conntrack_reasm.c
XFRM can deal with SYNACK messages, sent while listener socket
is not locked. We add proper rcu protection to __xfrm_sk_clone_policy()
and xfrm_sk_policy_lookup()
This might serve as the first step to remove xfrm.xfrm_policy_lock
use in fast path.
Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon switch sk->sk_policy[] to RCU protection,
as SYNACK packets are sent while listener socket is not locked.
This patch simply adds RCU grace period before struct xfrm_policy
freeing, and the corresponding rcu_head in struct xfrm_policy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a static inline function ipv6_addr_prefix_copy which
copies a ipv6 address prefix(argument pfx) into the ipv6 address prefix.
The prefix len is given by plen as bits. This function mainly based on
ipv6_addr_prefix which copies one address prefix from address into a new
ipv6 address destination and zero all other address bits.
The difference is that ipv6_addr_prefix_copy don't get a prefix from an
ipv6 address, it sets a prefix to an ipv6 address with keeping other
address bits. The use case is for context based address compression
inside 6LoWPAN IPHC header which keeping ipv6 prefixes inside a context
table to lookup address-bits without sending them.
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Łukasz Duda <lukasz.duda@nordicsemi.no>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch will introduce a 6lowpan entry into the debugfs if enabled.
Inside this 6lowpan directory we create a subdirectories of all 6lowpan
interfaces to offer a per interface debugfs support.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduces register and unregister functionality for lowpan
interfaces. While register a lowpan interface there are several things
which need to be initialize by the 6lowpan subsystem. Upcoming
functionality need to register/unregister per interface components e.g.
debugfs entry.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This flag just tells us whether hdev->adv_instances is empty or not.
We can equally well use the list_empty() function to get this
information.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The request to update HCI during power on is always coming either from
hdev->req_workqueue or through an ioctl, so it's safe to use
hci_req_sync for it. This way we also eliminate potential races with
incoming mgmt commands or other actions while powering on.
Part of this refactoring is the splitting of mgmt_powered() into
mgmt_power_on() and __mgmt_power_off() functions. The main reason is
the different requirements as far as hdev locking is concerned, as
highlighted with the __ prefix of the power off API.
Since the power on in the case of clearing the AUTO_OFF flag cannot be
done synchronously in the set_powered mgmt handler, the hci_power_on
work callback is extended to cover this (which also simplifies the
set_powered helper a lot).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since the other discoverable changes are behind req_workqueue now it
only makes sense to move the discoverable timeout there as well.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The discoverable mode is intrinsically linked with the connectable
mode e.g. through sharing the same HCI command (Write Scan Enable) for
BR/EDR. It makes therefore sense to move it to hci_request.c and run
the changes through the same hdev->req_workqueue.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This way the connectable changes are synchronized against each other,
which helps avoid potential races. The connectable mode is also linked
together with LE advertising which makes is more convenient to have it
behind the same workqueue.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This paves the way for eventually performing advertising changes
through the hdev->req_workqueue. Some new APIs need to be exposed from
mgmt.c to hci_request.c and vice-versa, but many of them will go away
once hdev->req_workqueue gets used.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since Add/Remove Device perform the page scan updates independently
from the HCI command completion we've introduced a potential race when
multiple mgmt commands are queued. Doing the page scan updates through
the req_workqueue ensures that the state changes are performed in a
race-free manner.
At the same time, to make the request helper more widely usable,
extend it to also cover Inquiry Scan changes since those are behind
the same HCI command. This is also reflected in the new name of the
API as well as the work struct name.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Only needed when meta nftrace rule(s) were added.
The assumption is that no such rules are active, so the call to
nft_trace_init is "never" needed.
When nftrace rules are active, we always call the nft_trace_* functions,
but will only send netlink messages when all of the following are true:
- traceinfo structure was initialised
- skb->nf_trace == 1
- at least one subscriber to trace group.
Adding an extra conditional
(static_branch ... && skb->nf_trace)
nft_trace_init( ..)
Is possible but results in a larger nft_do_chain footprint.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft monitor mode can then decode and display this trace data.
Parts of LL/Network/Transport headers are provided as separate
attributes.
Otherwise, printing IP address data becomes virtually impossible
for userspace since in the case of the netdev family we really don't
want userspace to have to know all the possible link layer types
and/or sizes just to display/print an ip address.
We also don't want userspace to have to follow ipv6 header chains
to get the s/dport info, the kernel already did this work for us.
To avoid bloating nft_do_chain all data required for tracing is
encapsulated in nft_traceinfo.
The structure is initialized unconditionally(!) for each nft_do_chain
invocation.
This unconditionall call will be moved under a static key in a
followup patch.
With lots of help from Patrick McHardy and Pablo Neira.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
and its accessors are defined in cgroup-defs.h. This is to prepare
for overloading the fields with a cgroup pointer.
This patch mostly performs equivalent conversions but the followings
are noteworthy.
* Equality test before updating classid is removed from
sock_update_classid(). This shouldn't make any noticeable
difference and a similar test will be implemented on the helper side
later.
* sock_update_netprioidx() now takes struct sock_cgroup_data and can
be moved to netprio_cgroup.h without causing include dependency
loop. Moved.
* The dummy version of sock_update_netprioidx() converted to a static
inline function while at it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netprio builds per-netdev contiguous priomap array which is indexed by
css->id. The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range. This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.
This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 0d76d6e8b2 and merge
commit c402293bd7, reversing changes made
to c89359a42e.
The virtio-vsock device specification is not finalized yet. Michael
Tsirkin voiced concerned about merging this code when the hardware
interface (and possibly the userspace interface) could still change.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP SYNACK messages might now be attached to request sockets.
XFRM needs to get back to a listener socket.
Adds new helpers that might be used elsewhere :
sk_to_full_sk() and sk_const_to_full_sk()
Note: We also need to add RCU protection for xfrm lookups,
now TCP/DCCP have lockless listener processing. This will
be addressed in separate patches.
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since no more DSA driver uses the polling callback, and since
the phylib handles the link detection, remove the link polling
work and timer code.
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
needing to reshuffle and fix some bugs. I merged mac80211
to get the right base for some of these changes.
* new mac80211 API for upcoming driver changes: EOSP handling,
key iteration
* scan abort changes allowing to cancel an ongoing scan
* VHT IBSS 80+80 MHz support
* re-enable full AP client state tracking after fixes
* various small fixes (that weren't relevant for mac80211)
* various cleanups
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWZVw7AAoJEGt7eEactAAdQcgP/1bOBBKgCHWZ8xhqmhLIPPUP
AgkkyBcjCbSOWyE1axm5WQZM+fQvyGAcYsnhsK7h0Wy5Jvv6goNYhxkoD3L5lAKC
LkiiqokTpLx1Em6Iugn1sdgag8q7EquYYQN+hOEOWtp32pTsx3/pDglCtGu0SX1N
eystHEAu6mzPezat99M4s80fRlfBop3yaUuL5XopQFGtU37zfUgoXJB3BoXgxNjK
XyD22jtPDreDMndZ9ugfvMaiq3iKRBhKXqgGb3SqMaStIyRK8zAkHb5jg3CllMeq
bEsz4Rb4r+vtm2AVsUMWjfd/upQKwPwuvdvCvv4AQCO+aR9Rm+tR/wnnD4Gtnek5
zPQ6XWt/0V4CKGl+W9shnDSA1DZ3hTijJlaGsK+RUqEtdq903lEP7fc2GsSvlund
jXHfOExieuZOToKWTKpmNGsCw6fjJaGXNd/iLWo5VGAZS2X+JLmFZ94g43a6zOGZ
s1Gz4F3tz4u4Bd26NAK2Z6CQRvDS4OOyLIjl9vpB9Fk/9nQx3f7WD8aBTRuCVAtG
U2sFEUscz3rkdct30Gvkjm3ovmgc4pomTDvOpmNIsSCi2ygzGWHbEvSrrHdIjzVy
KDcvRs6bRtCL/WxaaEIk46M6+6aKlSnZytPLl7vkNnvxXuEF7GYdnNVSUbSH9Nte
XzT4+rZRiqyPZEGhBekw
=+5dd
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-12-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This pull request got a bit bigger than I wanted, due to
needing to reshuffle and fix some bugs. I merged mac80211
to get the right base for some of these changes.
* new mac80211 API for upcoming driver changes: EOSP handling,
key iteration
* scan abort changes allowing to cancel an ongoing scan
* VHT IBSS 80+80 MHz support
* re-enable full AP client state tracking after fixes
* various small fixes (that weren't relevant for mac80211)
* various cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
when A sends a data to B, then A close() and enter into SHUTDOWN_PENDING
state, if B neither claim his rwnd is 0 nor send SACK for this data, A
will keep retransmitting this data until t5 timeout, Max.Retrans times
can't work anymore, which is bad.
if B's rwnd is not 0, it should send abort after Max.Retrans times, only
when B's rwnd == 0 and A's retransmitting beyonds Max.Retrans times, A
will start t5 timer, which is also commit f8d9605243 ("sctp: Enforce
retransmission limit during shutdown") means, but it lacks the condition
peer rwnd == 0.
so fix it by adding a bit (zero_window_announced) in peer to record if
the last rwnd is 0. If it was, zero_window_announced will be set. and use
this bit to decide if start t5 timer when local.state is SHUTDOWN_PENDING.
Fixes: commit f8d9605243 ("sctp: Enforce retransmission limit during shutdown")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy
related to disabling sock timestamp.
When SCTP accepts an association or peel one off, it copies sock flags
but forgot to call net_enable_timestamp() if a packet timestamping flag
was copied, leading to extra calls to net_disable_timestamp() whenever
such clones were closed.
The fix is to call net_enable_timestamp() whenever we copy a sock with
that flag on, like tcp does.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3511494ce2 ("vxlan: Group Policy extension") changed definition of
VXLAN_HF_RCO from 0x00200000 to BIT(24). This is obviously incorrect. It's
also in violation with the RFC draft.
Fixes: 3511494ce2 ("vxlan: Group Policy extension")
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement new functionality for aborting an ongoing scan.
Add NL80211_CMD_ABORT_SCAN to the nl80211 interface. After
aborting the scan, driver shall provide the scan status by
calling cfg80211_scan_done().
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
[change command to take wdev instead of netdev so that it
can be used on p2p-device scans]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add new VIF flag, that will allow get NOA update
notification when driver will request this, even
this is not pure P2P vif (eg. STA vif).
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
add ieee80211_iter_keys_rcu() to iterate over uploaded
keys in atomic context (when rcu is locked)
The station removal code removes the keys only after
calling synchronize_net(), so it's not safe to iterate
the keys at this point (and postponing the actual key
deletion with call_rcu() might result in some
badly-ordered ops calls).
Add a flag to indicate a station is being removed,
and skip the configured keys if it's set.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can happen when the driver needs to send less frames
than expected and then needs to close the SP.
Mac80211 still needs to set the more_data properly based
on its buffer state (ps_tx_buffer and buffered frames on
other TIDs).
To that end, refactor the code that delivers frames upon
uAPSD trigger frames to be able to get only the more_data
bit without actually delivering those frames in case the
driver is just asking to set a NDP with EOSP and MORE_DATA
bit properly set.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function is a very simple wrapper around another one,
just adds a few default parameters, so replace it with a
static inline instead of using EXPORT_SYMBOL, reducing
the module size slightly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some devices or drivers cannot deal with having the same station
address for different virtual interfaces, say as a client to two
virtual AP interfaces. Rather than requiring each driver with a
limitation like that to enforce it, add a hardware flag for it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
I want to get the full off-channel bugfix since later code depends on
it, as well as the AP client state change so I can revert it correctly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
drivers/net/ethernet/renesas/ravb_main.c
kernel/bpf/syscall.c
net/ipv4/ipmr.c
All three conflicts were cases of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix scanning in mac80211 to not actively scan radar
channels (from Antonio)
* fix uninitialized variable in remain-on-channel that
could lead to treating frame TX as remain-on-channel
and not sending the frame at all
* remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
was broken and needs more work, we'll enable it later
* fix call_rcu() induced use-after-reset/free in mesh
(that was suddenly causing issues in certain tests)
* always request block-ack window size 64 as we found
some APs will otherwise crash (really ...)
* fix P2P-Device teardown sequence to avoid restarting
with uninitialized data
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWX2JpAAoJEGt7eEactAAd+9cQAJmn3zt0orj/sASv7BeF0h5d
sRfAhkBVOTZur8MgVj1c7fNzT3h1HYNei6c4SA2+rphy6Vbifoli1nLNloC+1Ld4
2WXllEqVe473GqVofCxZHsYZPr2Inmhj7uMiDqvoKUiSRz7phmkY9m+Vju6WZG/W
F6FrTLqFS7UDHIYNYH1DNVSScd/89Gu6pHZvEpoHkrsvt5rEEZiPAQ7sDB4MAMSm
amETtuBqgX83gHR2G4UT2Z9r8TVdzhO+s7vvdVjj0qbP6C6BaS9IUXDjmm3gOvHy
G7j9MJuUC8w/2fZ5A5/l94OuN5rF/ZFMNkn2e6OIzg0HjEZh74CeLl21CnuxdNpB
ECmDVbKoI3OVoFbhEl7P5fBokzZsqhAXpZOmbYEeFRyO6lF2Mv9uzttsF6EOmCX0
BjIoEXOWA2o6IUD/8M6NjW+/B58SDDVi9Mg6D+7Dn7rUFlQ4pddjb0m94bI8GQQU
wl7gROMvYR3tIhiMs1bLF9jJgA831WGWu9eiq8mT2kHPaEV2bFO7OK+SUxyZu1M7
UhN4eoLpU84v9QNJ34N8RCiYxEZ1e6HQxBwQn/fDIWOjOHryZoArhicFY9aOEja4
9xBI9OJhBWOL4N4AFdmTExBdYudSgCTpX+/gQ4tSfedz3lqF79y8+PILwv6E1Q6D
8pScH/4pVo4v5omGaMpA
=vui7
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A small set of fixes for 4.4:
* fix scanning in mac80211 to not actively scan radar
channels (from Antonio)
* fix uninitialized variable in remain-on-channel that
could lead to treating frame TX as remain-on-channel
and not sending the frame at all
* remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
was broken and needs more work, we'll enable it later
* fix call_rcu() induced use-after-reset/free in mesh
(that was suddenly causing issues in certain tests)
* always request block-ack window size 64 as we found
some APs will otherwise crash (really ...)
* fix P2P-Device teardown sequence to avoid restarting
with uninitialized data
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.
One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
[ 681.774970] [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
[ 681.774972] [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
[ 681.774983] [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
[ 681.774985] [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
[ 681.774987] [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
[ 681.774989] [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
[ 681.774991] [<ffffffffa9089560>] ? sort_range+0x40/0x40
[ 681.774992] [<ffffffffa9085fe4>] kthread+0xe4/0x100
[ 681.774994] [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.
A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.
Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.
Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.
A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.
Reported-by: Daniele Fucini <dfucini@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let netdev notifier listeners know about link and slave state change.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to state notifications.
We allow caller to indicate if the notification should happen now or later,
depending on if he holds rtnl mutex or not. Introduce bond_slave_link_notify
function (similar to bond_slave_state_notify) which is later on called
with rtnl mutex and goes over slaves and executes delayed notification.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing the np->opt RCU conversion, I found that UDP/IPv6 was
using a mixture of xchg() and sk_dst_lock to protect concurrent changes
to sk->sk_dst_cache, leading to possible corruptions and crashes.
ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
way to fix the mess is to remove sk_dst_lock completely, as we did for
IPv4.
__ip6_dst_store() and ip6_dst_store() share same implementation.
sk_setup_caps() being called with socket lock being held or not,
we have to use sk_dst_set() instead of __sk_dst_set()
Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
This is because ip6_dst_store() can be called from process context,
without any lock held.
As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
from another cpu doing a concurrent ip6_dst_store()
Doing the dst dereference before doing the install is needed to make
sure no use after free would trigger.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp_send_ack() can not allocate skb, we properly handle this
and setup a timer to try later.
Use __GFP_NOWARN to avoid polluting syslog in the case host is
under memory pressure, so that pertinent messages are not lost under
a flood of useless information.
sk_gfp_atomic() can use its gfp_mask argument (all callers currently
were using GFP_ATOMIC before this patch)
We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this
function now takes into account its second argument (gfp_mask)
Note that when tcp_transmit_skb() is called with clone_it set to false,
we do not attempt memory allocations, so can pass a 0 gfp_mask, which
most compilers can emit faster than a non zero or constant value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They don't need to be any bigger than that and with this we start a new
bitfield for tracking association runtime stuff, like zero window
situation.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses multiple problems :
UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.
Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())
This patch adds full RCU protection to np->opt
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.
Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.
The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.
We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.
It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.
In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup to make following patch easier to
review.
Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()
To ease backports, we rename both constants.
Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit ab450605b3.
In IPv6, we cannot inherit the dst of the original dst. ndisc packets
are IPv6 packets and may take another route than the original packet.
This patch breaks the following scenario: a packet comes from eth0 and
is forwarded through vxlan1. The encapsulated packet triggers an NS
which cannot be sent because of the wrong route.
CC: Jiri Benc <jbenc@redhat.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active. This patch generalises it
by making it take a wait_queue_head_t directly. The existing
helper is renamed to skwq_has_sleeper.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for mangling packet payload. Checksum for the specified base
header is updated automatically if requested, however no updates for any
kind of pseudo headers are supported, meaning no stateless NAT is supported.
For checksum updates different checksumming methods can be specified. The
currently supported methods are NONE for no checksum updates, and INET for
internet type checksums.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.
To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.
Note: similar patch has been already submitted by Yoshifuji Hideaki in
http://patchwork.ozlabs.org/patch/220979/
but got lost and forgotten for some reason.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-11-23
Here's the first bluetooth-next pull request for the 4.5 kernel.
- Add new Get Advertising Size Information management command
- Add support for new system note message type on monitor channel
- Refactor LE scan changes behind separate workqueue to avoid races
- Fix issue with privacy feature when powering on adapter
- Various minor fixes & cleanups here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ec0d215f94 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch changed nf_ct_frag6_gather() to morph reassembled skb
with the previous one.
This means that the return value is always NULL or the skb argument.
So change it to an err value.
Instead of invoking NF_HOOK recursively with threshold to skip already-called hooks
we can now just return NF_ACCEPT to move on to the next hook except for
-EINPROGRESS (which means skb has been queued for reassembly), in which case we
return NF_STOLEN.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 6aafeef03b
("netfilter: push reasm skb through instead of original frag skbs")
changed ipv6 defrag to not use the original skbs anymore.
So rather than keeping the original skbs around just to discard them
afterwards just use the original skbs directly for the fraglist of
the newly assembled skb and remove the extra clone/free operations.
The skb that completes the fragment queue is morphed into a the
reassembled one instead, just like ipv4 defrag.
openvswitch doesn't need any additional skb_morph magic anymore to deal
with this situation so just remove that.
A followup patch can then also remove the NF_HOOK (re)invocation in
the ipv6 netfilter defrag hook.
Cc: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Some boards have a gpio line tied to the switch reset pin. Allow this
gpio to be retrieved from the device tree, and take the switch out of
reset before performing the probe.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Get Advertising Size Information command allows to retrieve size
information for advertising data and scan response data fields depending
on the selected flags. This is useful if applications want to know the
available size ahead of time.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Advertising reordering window in ADDBA less than 64 can crash some APs,
an example is LinkSys WRT120N (with FW v1.0.07 build 002 Jun 18 2012).
On the other hand, a driver may need to limit Tx A-MPDU size for its own
reasons, like specific HW limitations.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The hci_connect_le_scan() is (as the name implies) a master/central
role API, so it makes no sense in passing a role parameter to it. At
the same time this patch also fixes the direct advertising support for
LE L2CAP sockets where we now call the more appropriate hci_le_connect()
API if slave/peripheral role is desired.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since discovery also deals with LE scanning it makes sense to move it
behind the same req_workqueue as other LE scanning changes. This also
simplifies the logic since we do many of the actions in a synchronous
manner.
Part of this refactoring is moving hci_req_stop_discovery() to
hci_request.c. At the same time the function receives support for
properly handling the STOPPING state since that's the state we'll be
in when stopping through the req_workqueue.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since discovery also deals with LE scanning it makes sense to move it
behind the same req_workqueue as other LE scanning changes. This also
simplifies the logic since we do many of the actions in a synchronous
manner.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To avoid any risks of races, place also these LE scan modification
work callbacks behind the same work queue as the other LE scan
changes.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In some cases it may be important to get the exact HCI status rather
than the converted HCI-to-errno value. Add an optional return
parameter to the hci_req_sync() API to allow for this. Since there are
no good HCI translation candidates for cancelation and timeout, use
the "unknown" status code for those cases.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Instead of firing off a simple async request queue all background scan
updates through req_workqueue and use hci_req_sync() there to ensure
that no two updates overlap with each other.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The hci_conn_params_clear_all() function is only called from
hci_unregister_dev() at which point it's completely futile to try to
do any LE scanning updates. Simply remove this unnecessary function
call. At the same time we can make the function static since it's only
accessed from within the same c-file.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To enable controller specific logging, the userspace daemon has to have
the ability to log per controller. To facilitate this support, provide
a dedicated logging channel. Messages in this channel will be included
in the monitor queue and with that also forwarded to monitoring tools
along with the actual hardware traces.
All messages from the logging channel are timestamped and with that
allow an easy correlation between userspace messages and hardware
events. This will increase the ability to debug problems faster.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The monitor channel can be used to send generic system notes as text
strings for debugging purposes. This adds the system note monitor code
and uses it for including kernel and subsystem version into traces.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
We can reduce the size of the hci_ctrl struct by converting
'bool req_start' to 'u8 req_flags' and making the two function
pointers a union (since only one is ever set at a time).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The socket allocation functions will always memset skb->cb to zero so
there's no need to make other initializations needing the same thing.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
For all the HCI driver related variables accesssed via bt_cb(skb),
provide helper wrappers.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
There is really little gain from inlining this big function.
We'll soon make it even bigger in following patches.
This means we no longer need to export napi_by_id()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_fdb_dump always expects an index to be returned by the ndo_fdb_dump op,
but when CONFIG_NET_SWITCHDEV is off, it returns an error.
Fix that by returning the given unmodified idx.
A similar fix was 0890cf6cb6 ("switchdev: fix return value of
switchdev_port_fdb_dump in case of error") but for the CONFIG_NET_SWITCHDEV=y
case.
Fixes: 45d4122ca7 ("switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.")
Signed-off-by: Dragos Tatulea <dragos@endocode.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers like vxlan use the recently introduced
udp_tunnel_xmit_skb/udp_tunnel6_xmit_skb APIs. udp_tunnel6_xmit_skb
makes use of ip6tunnel_xmit, and ip6tunnel_xmit, after sending the
packet, updates the struct stats using the usual
u64_stats_update_begin/end calls on this_cpu_ptr(dev->tstats).
udp_tunnel_xmit_skb makes use of iptunnel_xmit, which doesn't touch
tstats, so drivers like vxlan, immediately after, call
iptunnel_xmit_stats, which does the same thing - calls
u64_stats_update_begin/end on this_cpu_ptr(dev->tstats).
While vxlan is probably fine (I don't know?), calling a similar function
from, say, an unbound workqueue, on a fully preemptable kernel causes
real issues:
[ 188.434537] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u8:0/6
[ 188.435579] caller is debug_smp_processor_id+0x17/0x20
[ 188.435583] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.6 #2
[ 188.435607] Call Trace:
[ 188.435611] [<ffffffff8234e936>] dump_stack+0x4f/0x7b
[ 188.435615] [<ffffffff81915f3d>] check_preemption_disabled+0x19d/0x1c0
[ 188.435619] [<ffffffff81915f77>] debug_smp_processor_id+0x17/0x20
The solution would be to protect the whole
this_cpu_ptr(dev->tstats)/u64_stats_update_begin/end blocks with
disabling preemption and then reenabling it.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.
tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...
Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All DST_NOCACHE rt6_info used to have rt->dst.from set to
its parent.
After commit 8e3d5be736 ("ipv6: Avoid double dst_free"),
DST_NOCACHE is also set to rt6_info which does not have
a parent (i.e. rt->dst.from is NULL).
This patch catches the rt->dst.from == NULL case.
Fixes: 8e3d5be736 ("ipv6: Avoid double dst_free")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree. This
large batch that includes fixes for ipset, netfilter ingress, nf_tables
dynamic set instantiation and a longstanding Kconfig dependency problem.
More specifically, they are:
1) Add missing check for empty hook list at the ingress hook, from
Florian Westphal.
2) Input and output interface are swapped at the ingress hook,
reported by Patrick McHardy.
3) Resolve ipset extension alignment issues on ARM, patch from Jozsef
Kadlecsik.
4) Fix bit check on bitmap in ipset hash type, also from Jozsef.
5) Release buckets when all entries have expired in ipset hash type,
again from Jozsef.
6) Oneliner to initialize conntrack tuple object in the PPTP helper,
otherwise the conntrack lookup may fail due to random bits in the
structure holes, patch from Anthony Lineham.
7) Silence a bogus gcc warning in nfnetlink_log, from Arnd Bergmann.
8) Fix Kconfig dependency problems with TPROXY, socket and dup, also
from Arnd.
9) Add __netdev_alloc_pcpu_stats() to allow creating percpu counters
from atomic context, this is required by the follow up fix for
nf_tables.
10) Fix crash from the dynamic set expression, we have to add new clone
operation that should be defined when a simple memcpy is not enough.
This resolves a crash when using per-cpu counters with new Patrick
McHardy's flow table nft support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.
2) Several spots need to get to the original listner for SYN-ACK
packets, most spots got this ok but some were not. Whilst covering
the remaining cases, create a helper to do this. From Eric Dumazet.
3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
from Rasmus Villemoes.
4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.
5) Use after free in mvneta driver, from Justin Maggard.
6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.
7) Add missing ZLIB_INFLATE dependency for new qed driver. From Arnd
Bergmann.
8) Fix multicast getsockopt deadlock, from WANG Cong.
9) Fix deadlock in btusb, from Kuba Pawlak.
10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
counter state. From Sabrina Dubroca.
11) Fix packet_bind() race, which can cause lost notifications, from
Francesco Ruggeri.
12) Fix MAC restoration in qlcnic driver during bonding mode changes,
from Jarod Wilson.
13) Revert bridging forward delay change which broke libvirt and other
userspace things, from Vlad Yasevich.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
Revert "bridge: Allow forward delay to be cfgd when STP enabled"
bpf_trace: Make dependent on PERF_EVENTS
qed: select ZLIB_INFLATE
net: fix a race in dst_release()
net: mvneta: Fix memory use after free.
net: Documentation: Fix default value tcp_limit_output_bytes
macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
mvneta: add FIXED_PHY dependency
net: caif: check return value of alloc_netdev
net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
drivers: net: xgene: fix RGMII 10/100Mb mode
netfilter: nft_meta: use skb_to_full_sk() helper
net_sched: em_meta: use skb_to_full_sk() helper
sched: cls_flow: use skb_to_full_sk() helper
netfilter: xt_owner: use skb_to_full_sk() helper
smack: use skb_to_full_sk() helper
net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
bpf: doc: correct arch list for supported eBPF JIT
dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
bonding: fix panic on non-ARPHRD_ETHER enslave failure
...
With the conversion of the counter expressions to make it percpu, we
need to clone the percpu memory area, otherwise we crash when using
counters from flow tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Generalize selinux_skb_sk() added in commit 212cd08953
("selinux: fix random read in selinux_ip_postroute_compat()")
so that we can use it other contexts.
Use it right away in selinux_netlbl_skbuff_setsid()
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hedberg says:
====================
pull request: bluetooth 2015-11-05
The following set of Bluetooth patches would be good to get into 4.4-rc1
if possible:
- Fix for missing LE CoC parameter validity checks
- Fix for potential deadlock in btusb
- Fix for issuing unsupported commands during HCI init
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The core spec defines specific response codes for situations when the
received CID is incorrect. Add the defines for these and return them
as appropriate from the LE Connect Request handler function.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In tun_dst_unclone() the return value of skb_metadata_dst() is checked
for being NULL after it is dereferenced. Fix this by moving the
dereference after the NULL check.
Found by the Coverity scanner (CID 1338068).
Fixes: fc4099f172 ("openvswitch: Fix egress tunnel info.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in net/ipv4/ipmr.c, in 'net' we were
fixing the "BH-ness" of the counter bumps whilst in 'net-next'
the functions were modified to take an explicit 'net' parameter.
Signed-off-by: David S. Miller <davem@davemloft.net>
* remove a warning on a check that can trigger without any
errors having happened (Andrei)
* correctly handle deauth request while in the process of
associating (Andrei)
* fix TDLS HT operation (Arik)
* allow changing AID/listen interval during client setup (Ayala)
* be more forgiving with WMM parameters to get HT/VHT in case of
broken APs with bad WMM settings (Emmanuel, myself)
* a number of other fixes (some in documentation)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJWOKsAAAoJEDBSmw7B7bqrNKMQAKFH81CscgJGOQb/zgGdmuF3
kNrWrnH+3XqoqM2rHpIekQLxVkeUhM+hHCyaPCK7rVnCuu53pJ0u7P0rq922XAW4
olFBGVdE1yG/69ndR9MYDjLWP+ikMmAiMLbM5qzPuDJ5XyBVACC1D82+qSRvByCK
Z8PYJ+OsLk05eKa4ER7i8BVExRGM4vrce4Uh3K07yNKbfU81ztsltwflleRFGn3f
OCydpuSId+C4TuSmkgBJF1718B9GazvAbZDw5t4jorIrbiZzQMZAtoi+YxwXVhev
lvPCO8p1+lhWYUOK5LnO8mbdUFfe+kc3rrZjKWuXuDLp6mvPyP9FOaIFjFfnlJxT
8QadG0QDzTlLHUj29gvrnww8aob9c7iHueXP9OlcBMp9uTyklgBJ+fMyvPfXpWXB
Diy9n0VJfWzg8d74wWLLQy/N1qY6gwhXXwgW8TM/49O5BpbyvVsI6jFAR+8ZT9b9
GLGEkN68RBuY03mejkf4PmhqgMVErA2JtabRI0Efm2Do85t9ZxgObF6INsrZ+o2M
ffl7jhyHsFB+d38Ilwlb4cyWhxpIGrhTtt2h5zIsgNx3wmrXrarwMM3P4NGOOEbP
Euqkk/LoMZdjjB/78JSi6hdQSYoQFaW85tHBzXhMXk0nYXHLWdVEJsLuAtATl8gM
vzNkny8pcaLnRg/kXqgl
=/d5+
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Another set of fixes:
* remove a warning on a check that can trigger without any
errors having happened (Andrei)
* correctly handle deauth request while in the process of
associating (Andrei)
* fix TDLS HT operation (Arik)
* allow changing AID/listen interval during client setup (Ayala)
* be more forgiving with WMM parameters to get HT/VHT in case of
broken APs with bad WMM settings (Emmanuel, myself)
* a number of other fixes (some in documentation)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Channel context driver operations can sleep, so add might_sleep()
and document this.
Signed-off-by: Chaitanya T K <chaitanya.mgit@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The previous patch changed mac80211 to always report an event
after a CQM RSSI reconfiguration. Document that as expected
behaviour in both the cfg80211 and mac80211 API.
Currently, iwlmvm already implements that behaviour; the other
drivers implementing CQM RSSI events may have to be changed.
This behaviour lets userspace know what the current state is
without relying on querying the data which is racy.
Reviewed-by: Sharon, Sara <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Old logic of updating state-machine is not required since
ad_update_actor_keys() does it implicitly. The only loss is
the notification differentiation between speed vs. duplex
change. Now only one unified notification is printed.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes following problems :
1) percpu_counter_init() can return an error, therefore
init_frag_mem_limit() must propagate this error so that
inet_frags_init_net() can do the same up to its callers.
2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
properly and free the percpu_counter.
Without this fix, we leave freed object in percpu_counters
global list (if CONFIG_HOTPLUG_CPU) leading to crashes.
This bug was detected by KASAN and syzkaller tool
(http://github.com/google/syzkaller)
Fixes: 6d7b857d54 ("net: use lib/percpu_counter API for fragmentation mem accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under low memory conditions, tcp_sk_init() and icmp_sk_init()
can both iterate on all possible cpus and call inet_ctl_sock_destroy(),
with eventual NULL pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_set_owner_w() is called from various places that assume
skb->sk always point to a full blown socket (as it changes
sk->sk_wmem_alloc)
We'd like to attach skb to request sockets, and in the future
to timewait sockets as well. For these kind of pseudo sockets,
we need to take a traditional refcount and use sock_edemux()
as the destructor.
It is now time to un-inline skb_set_owner_w(), being too big.
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
we should not delete the local routes if the local address
is still present. The confusion comes from the fact that both
fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
constant. Fix it by returning back the variable 'force'.
Steps to reproduce:
modprobe dummy
ifconfig dummy0 192.168.168.1 up
ifconfig dummy0 down
ip route list table local | grep dummy | grep host
local 192.168.168.1 dev dummy0 proto kernel scope host src 192.168.168.1
Fixes: 8a3d03166f ("net: track link-status of ipv4 nexthops")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify DSA by pushing the switchdev objects for VLAN add and delete
operations down to its drivers. Currently only mv88e6xxx is affected.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SS_LISTEN socket state is defined by both af_vsock.c and
vmci_transport.c. This is risky since the value could be changed in one
file and the other would be out of sync.
Rename from SS_LISTEN to VSOCK_SS_LISTEN since the constant is not part
of enum socket_state (SS_CONNECTED, ...). This way it is clear that the
constant is vsock-specific.
The big text reflow in af_vsock.c was necessary to keep to the maximum
line length. Text is unchanged except for s/SS_LISTEN/VSOCK_SS_LISTEN/.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the NFC pull request for 4.4.
It's a bit bigger than usual, the 3 main culprits being:
- A new driver for Intel's Fields Peak NCI chipset. In order to
support this chipset we had to export a few NCI routines and
extend the driver NCI ops to not only support proprietary
commands but also core ones.
- Support for vendor commands for both STM drivers, st-nci
and st21nfca. Those vendor commands allow to run factory tests
through the NFC netlink interface.
- New i2c and SPI support for the Marvell driver, together with
firmware download support for this driver's core.
Besides that we also have:
- A few file renames in the STM drivers, to keep the naming
consistent between drivers.
- Some improvements and fixes on the NCI HCI layer, mostly to
properly reach a secure element over a legacy HCI link.
- A few fixes for the s3fwrn5 and trf7970a drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWMGIaAAoJEIqAPN1PVmxKHjYP/3Q3Y4Vhvw3kTfDP3IlnAuH3
XjBMGKPLu72MmtSk9jFOr5VuC76YtJzwf+4nKGJybu619NPKfxXN7r83bpsZV1Bk
+0cS1RpjIQh92a0ElvX1muCFhgH7ax7zeqQ+29OSpA33e67/DlUcwxiqzF15cwWC
Bk0pUv1FxMoNi5ZkG1JrRqrhx/Yqo1dw2HrnMbKVgwLtLzODBuzoGVKfydTo0b1j
hkl30DPF3AYMxnwIml3tM8zT96b1LtD0Xgs1yF8IdrIJ+6YLn/6tnw1rUxnE8Ovo
JFtvtS0OKjGTNFr1NhueG0i5td8TR4MAJKHh0Lz9ISIFWtNtsVjFfGuAeWZcQ/n/
rQS7xOvzBCndNbe8PS9wBiNQAqLAH5/dzvKRwNboRttkpwIrNOgaBYj2LpuRzpfO
p+ArwBryAorfxQVOIWl4knc59UsiPUKOK61uMTZ1sU7jCEvUNVChIm8EGRlMnpMQ
ZFlBa2lqNdgz7ubKLofbnWLiCNY6r0E13MSLHZlJZX61IMjs13ojDeKMvitFBe+b
1hDwbSWxIRB8xKbcsIA9bPUnEc16Syywz/Q4iAsE8Gy6l5J41MhA/q2QaO9WSrPE
Leah53l5EwQRd55WjJkCkIKZwvCjIerkESfAS0oprELIYXaxs/1PbVl6C7VYZA4K
A5tYLw2vS+tTK4Mgi/ym
=aw/h
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.4 pull request
This is the NFC pull request for 4.4.
It's a bit bigger than usual, the 3 main culprits being:
- A new driver for Intel's Fields Peak NCI chipset. In order to
support this chipset we had to export a few NCI routines and
extend the driver NCI ops to not only support proprietary
commands but also core ones.
- Support for vendor commands for both STM drivers, st-nci
and st21nfca. Those vendor commands allow to run factory tests
through the NFC netlink interface.
- New i2c and SPI support for the Marvell driver, together with
firmware download support for this driver's core.
Besides that we also have:
- A few file renames in the STM drivers, to keep the naming
consistent between drivers.
- Some improvements and fixes on the NCI HCI layer, mostly to
properly reach a secure element over a legacy HCI link.
- A few fixes for the s3fwrn5 and trf7970a drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-10-28
Here are a some more Bluetooth patches for 4.4 which collected up during
the past week. The most important ones are from Kuba Pawlak for fixing
locking issues with SCO sockets. There's also a fix from Alexander Aring
for 6lowpan, a memleak fix from Julia Lawall for the btmrvl driver and
some cleanup patches from Marcel.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a build error that seems to be toochain
dependent (Not seen with gcc v5.1):
In file included from net/nfc/nci/rsp.c:36:0:
net/nfc/nci/rsp.c: In function ‘nci_rsp_packet’:
include/net/nfc/nci_core.h:355:12: error: inlining failed in call to
always_inline ‘nci_prop_rsp_packet’: function body not available
inline int nci_prop_rsp_packet(struct nci_dev *ndev, __u16 opcode,
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Adding IPv6 for the TSO helper API is trivial:
* Don't play with the id (which doesn't exist in IPv6)
* Correctly update the payload_len (don't include the
length of the IP header itself)
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some cases low level drivers might want to update the
SPI transfer clock (e.g. during firmware download).
This patch adds this support. Without any modification the
driver will use the default SPI clock (from pdata or device tree).
Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This driver adds the support of I2C-based Marvell NFC controller.
Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Export nci_send_frame and nci_send_cmd symbols to allow drivers
to use it. This is needed for example if NCI is used during
firmware download phase.
Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to manage in a better way the nci poll mode state machine,
add mode parameter to deactivate_target functions.
This way we can manage different target state.
mode parameter make sense only in nci core.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Add support for proprietary commands useful mainly for
factory testings. Here is a list:
- FACTORY_MODE: Allow to set the driver into a mode where
no secure element are activated. It does not consider any
NFC_ATTR_VENDOR_DATA.
- HCI_CLEAR_ALL_PIPES: Allow to execute a HCI clear all pipes
command. It does not consider any NFC_ATTR_VENDOR_DATA.
- HCI_DM_PUT_DATA: Allow to configure specific CLF registry
like for example RF trimmings or low level drivers
configurations (I2C, SPI, SWP).
- HCI_DM_UPDATE_AID: Allow to configure an AID routing into the
CLF routing table following RF technology, CLF mode or protocol.
- HCI_DM_GET_INFO: Allow to retrieve CLF information.
- HCI_DM_GET_DATA: Allow to retrieve CLF configurable data such as
low level drivers configurations or RF trimmings.
- HCI_DM_DIRECT_LOAD: Allow to load a firmware into the CLF.
A complete packet can be more than 8KB.
- HCI_DM_RESET: Allow to run a CLF reset in order to "commit" CLF
configuration changes without CLF power off.
- HCI_GET_PARAM: Allow to retrieve an HCI CLF parameter (for example
the white list).
- HCI_DM_FIELD_GENERATOR: Allow to generate different kind of RF
technology. When using this command to anti-collision is done.
- HCI_LOOPBACK: Allow to echo a command and test the Dh to CLF
connectivity.
- HCI_DM_VDC_MEASUREMENT_VALUE: Allow to measure the field applied
on the CLF antenna. A value between 0 and 0x0f is returned. 0 is
maximum.
- HCI_DM_FWUPD_START: Allow to put CLF into firmware update mode.
It is a specific CLF command as there is no GPIO for this.
- HCI_DM_FWUPD_END: Allow to complete firmware update.
- HCI_DM_VDC_VALUE_COMPARISON: Allow to compare the field applied
on the CLF antenna to a reference value.
- MANUFACTURER_SPECIFIC: Allow to retrieve manufacturer specific data
received during a NCI_CORE_INIT_CMD.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The SKB context buffer for HCI request is really not just for requests,
information in their are preserved for the whole HCI layer. So it makes
more sense to actually rename it into bt_cb()->hci and also call it then
struct hci_ctrl.
In addition that allows moving the decoded opcode for outgoing packets
into that struct. So far it was just consuming valuable space from the
main shared items. And opcode are not valid for L2CAP packets.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
NCI_HCI_IDENTITY_MGMT_GATE might be useful to get information
about hardware or firmware version.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
nci_hci_clear_all_pipes might be use full in some cases
for example after a firmware update.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This functin takes as a parameter a pointer to the nci_dev
struct and the first byte from the values of the first domain
specific parameter that was used for the connection creation.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Initially it was used to create hooks in the driver for
proprietary operations. Currently it is being used for hooks
for both proprietary and generic operations.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The driver may be required to act when some responses or
notifications arrive. For example the NCI core does not have a
handler for NCI_OP_CORE_GET_CONFIG_RSP. The NFCC can send a
config response that has to be read by the driver and the packet
may contain vendor specific data.
The Fields Peak driver needs to take certain actions when a reset
notification arrives (packet also not handled by the nfc core).
The driver handlers do not interfere with the core and they are
called after the core processes the packet.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This allows sending core commands from the driver. The driver
should be able to send NCI core commands like CORE_GET_CONFIG_CMD.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Add NCI_OP_CORE_GET_CONFIG_CMD, NCI_OP_CORE_GET_CONFIG_RSP
and NCI_OP_CORE_RESET_NTF.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
FDP driver needs to send the firmware as regular packets
(not fragmented). The driver should have a way to
get the max packet size for a given connection.
Signed-off-by: Robert Dolca <robert.dolca@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Conflicts:
net/ipv6/xfrm6_output.c
net/openvswitch/flow_netlink.c
net/openvswitch/vport-gre.c
net/openvswitch/vport-vxlan.c
net/openvswitch/vport.c
net/openvswitch/vport.h
The openvswitch conflicts were overlapping changes. One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.
The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-10-22
Here's probably the last bluetooth-next pull request for 4.4. Among
several other changes it contains the rest of the fixes & cleanups from
the Bluetooth UnplugFest (that didn't need to be hurried to 4.3).
- Refactoring & cleanups to 6lowpan code
- New USB ids for two Atheros controllers and BCM43142A0 from Broadcom
- Fix (quirk) for broken Broadcom BCM2045 controllers
- Support for latest Apple controllers
- Improvements to the vendor diagnostic message support
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for MPLS multipath routes.
Includes following changes to support multipath:
- splits struct mpls_route into 'struct mpls_route + struct mpls_nh'
- 'struct mpls_nh' represents a mpls nexthop label forwarding entry
- moves mpls route and nexthop structures into internal.h
- A mpls_route can point to multiple mpls_nh structs
- the nexthops are maintained as a array (similar to ipv4 fib)
- In the process of restructuring, this patch also consistently changes
all labels to u8
- Adds support to parse/fill RTA_MULTIPATH netlink attribute for
multipath routes similar to ipv4/v6 fib
- In this patch, the multipath route nexthop selection algorithm
simply returns the first nexthop. It is replaced by a
hash based algorithm from Robert Shearman in the next patch
- mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
mpls_route_update though implemented to update based on dev, it was
never used that way. And the dev handling gets tricky with multiple
nexthops. Cannot match against any single nexthops dev. So, this patch
removes the unused 'dev' handling in mpls_route_update.
- dead route/path handling will be implemented in a subsequent patch
Example:
$ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
nexthop as 700 via inet 10.1.1.6 dev swp2 \
nexthop as 800 via inet 40.1.1.2 dev swp3
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2
nexthop as to 800 via inet 40.1.1.2 dev swp3
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiple cpus can process duplicates of incoming ACK messages
matching a SYN_RECV request socket. This is a rare event under
normal operations, but definitely can happen.
Only one must win the race, otherwise corruption would occur.
To fix this without adding new atomic ops, we use logic in
inet_ehash_nolisten() to detect the request was present in the same
ehash bucket where we try to insert the new child.
If request socket was not found, we have to undo the child creation.
This actually removes a spin_lock()/spin_unlock() pair in
reqsk_queue_unlink() for the fast path.
Fixes: e994b2f0fb ("tcp: do not lock listener to process SYN packets")
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While transitioning to netdev based vport we broke OVS
feature which allows user to retrieve tunnel packet egress
information for lwtunnel devices. Following patch fixes it
by introducing ndo operation to get the tunnel egress info.
Same ndo operation can be used for lwtunnel devices and compat
ovs-tnl-vport devices. So after adding such device operation
we can remove similar operation from ovs-vport.
Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No driver implements port_fdb_getnext anymore, and port_fdb_dump is
preferred anyway, so remove this function from DSA.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not all switch chips support a Get Next operation to iterate on its FDB.
So add a more simple port_fdb_dump function for them.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* I merged net-next back to avoid a conflict with the
* cfg80211 scheduled scan API extensions
* preparations for better scan result timestamping
* regulatory cleanups
* mac80211 statistics cleanups
* a few other small cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJWJ6lbAAoJEDBSmw7B7bqraasP/Ryaa7zL10E+dOQtqBQHQeMe
olbrCUtTYltr4nnuESzh5WPeIVZBQ0DIduoLLF0IDSPVwE/NrbpFUVIMHvJvr+s7
rE9k8RB4P7BMTjf+mkDX1Od9kCKGkt4ezcyt/oNIsqM12SN9JQ99itwz6Mp94xCs
XKsiXJRh9f/8Qwd/74qQq1Va3UfGAVuKO8WpUe/A7TYTla8ZY20pv1D8kQKQzrFg
DwsMirjmHcUpobSjnPAAmZevRxdk6o0E+P7DYG172H2Tm8/EIMR/gYMnQeYW6HkA
lfMMDfAGmNvyRm8v1iuBLodREP4kn4VbhMSZDtH7D6FYfmJh5fSeG09bSe51G5Xh
zv/B8A1cCbWFqtQHp3wI6ml8VDyAhDc2Hvqb75KRn6FplIkEiszVP0y3cNHWiJVt
Ix6Sysoa6kQDXEgR50APeLJ3VI+/mhXmvIila4jP9PKhO14SDHrCoRQO62Z0COJ7
2E5Ir2KE8T+O9mSeuB7m8xD/t60HDd3q3tLZmH0Ps6xfxKf9y2hdZacbX4Hi5Mqk
2XxXZYnhAXUqZmZhmG3ajnEiB4UGMt21R7dIqNTaQ9chOGBkHqIZxPm82XtNb13h
yHILavGpUDT0z6OB2z8fxUcj4a4SrrK+aiIGh4iFpDR0Nu0IyZ5cPHXY2FfvJWmD
ZO74RMEpBodYR8BsV4yP
=uZ5N
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Here's another set of patches for the current cycle:
* I merged net-next back to avoid a conflict with the
* cfg80211 scheduled scan API extensions
* preparations for better scan result timestamping
* regulatory cleanups
* mac80211 statistics cleanups
* a few other small cleanups and fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
if_nlmsg_size() overestimates the minimum allocation size of netlink
dump request (when called from rtnl_calcit()) or the size of the
message (when called from rtnl_getlink()). This is because
ext_filter_mask is not supported by rtnl_link_get_af_size() and
rtnl_link_get_size().
The over-estimation is significant when at least one netdev has many
VLANs configured (8 bytes for each configured VLAN).
This patch-set "rightsizes" the protocol specific attribute size
calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
and adding this a argument to get_link_af_size op in rtnl_af_ops.
Bridge module already used filtering aware sizing for notifications.
br_get_link_af_size_filtered() is consistent with the modified
get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
br_get_link_af_size() becomes unused and thus removed.
Signed-off-by: Ronen Arad <ronen.arad@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's only one user of this helper which can be replaces with a call
to hci_pend_le_action_lookup() and a check for params->explicit_connect.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Many of the existing LE connection lookups are forced to use
hci_conn_hash_lookup_ba() which doesn't take into account the address
type. What's worse, most of the users don't bother checking that the
returned address type matches what was wanted.
This patch adds a new helper API to look up LE connections based on
their address and address type, paving the way to have the
hci_conn_hash_lookup_ba() users converted to do more precise lookups.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch implements the second half of RACK that uses the the most
recent transmit time among all delivered packets to detect losses.
tcp_rack_mark_lost() is called upon receiving a dubious ACK.
It then checks if an not-yet-sacked packet was sent at least
"reo_wnd" prior to the sent time of the most recently delivered.
If so the packet is deemed lost.
The "reo_wnd" reordering window starts with 1msec for fast loss
detection and changes to min-RTT/4 when reordering is observed.
We found 1msec accommodates well on tiny degree of reordering
(<3 pkts) on faster links. We use min-RTT instead of SRTT because
reordering is more of a path property but SRTT can be inflated by
self-inflicated congestion. The factor of 4 is borrowed from the
delayed early retransmit and seems to work reasonably well.
Since RACK is still experimental, it is now used as a supplemental
loss detection on top of existing algorithms. It is only effective
after the fast recovery starts or after the timeout occurs. The
fast recovery is still triggered by FACK and/or dupack threshold
instead of RACK.
We introduce a new sysctl net.ipv4.tcp_recovery for future
experiments of loss recoveries. For now RACK can be disabled by
setting it to 0.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is the first half of the RACK loss recovery.
RACK loss recovery uses the notion of time instead
of packet sequence (FACK) or counts (dupthresh). It's inspired by the
previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
transmit (new data packet) is sacked, then current retransmitted
sequence below the newly sacked sequence must been lost,
since at least one round trip time has elapsed.
But it has several limitations:
1) can't detect tail drops since it depends on limited transmit
2) is disabled upon reordering (assumes no reordering)
3) only enabled in fast recovery ut not timeout recovery
RACK (Recently ACK) addresses these limitations with the notion
of time instead: a packet P1 is lost if a later packet P2 is s/acked,
as at least one round trip has passed.
Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when later retransmission is
s/acked while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering.
This patch implements tcp_advanced_rack() which tracks the
most recent transmission time among the packets that have been
delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
is the key to determine which packet has been lost.
Consider an example that the sender sends six packets:
T1: P1 (lost)
T2: P2
T3: P3
T4: P4
T100: sack of P2. rack.mstamp = T2
T101: retransmit P1
T102: sack of P2,P3,P4. rack.mstamp = T4
T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
We need to be careful about spurious retransmission because it may
falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
to falsely mark all packets lost, just like a spurious timeout.
We identify spurious retransmission by the ACK's TS echo value.
If TS option is not applicable but the retransmission is acknowledged
less than min-RTT ago, it is likely to be spurious. We refrain from
using the transmission time of these spurious retransmissions.
The second half is implemented in the next patch that marks packet
lost using RACK timestamp.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kathleen Nichols' algorithm for tracking the minimum RTT of a
data stream over some measurement window. It uses constant space
and constant time per update. Yet it almost always delivers
the same minimum as an implementation that has to keep all
the data in the window. The measurement window is tunable via
sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
The algorithm keeps track of the best, 2nd best & 3rd best min
values, maintaining an invariant that the measurement time of
the n'th best >= n-1'th best. It also makes sure that the three
values are widely separated in the time window since that bounds
the worse case error when that data is monotonically increasing
over the window.
Upon getting a new min, we can forget everything earlier because
it has no value - the new min is less than everything else in the
window by definition and it's the most recent. So we restart fresh
on every new min and overwrites the 2nd & 3rd choices. The same
property holds for the 2nd & 3rd best.
Therefore we have to maintain two invariants to maximize the
information in the samples, one on values (1st.v <= 2nd.v <=
3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
now). These invariants determine the structure of the code
The RTT input to the windowed filter is the minimum RTT measured
from ACK or SACK, or as the last resort from TCP timestamps.
The accessor tcp_min_rtt() returns the minimum RTT seen in the
window. ~0U indicates it is not available. The minimum is 1usec
even if the true RTT is below that.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hci_conn objects don't have a dedicated lock themselves but rely
on the caller to hold the hci_dev lock for most types of access. The
hci_conn_timeout() function has so far sent certain HCI commands based
on the hci_conn state which has been possible without holding the
hci_dev lock.
The recent changes to do LE scanning before connect attempts added
even more operations to hci_conn and hci_dev from hci_conn_timeout,
thereby exposing potential race conditions with the hci_dev and
hci_conn states.
As an example of such a race, here there's a timeout but an
l2cap_sock_connect() call manages to race with the cleanup routine:
[Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT
[ +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT
[ +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT
[ +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT
[ +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4
[ +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[ +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3
[ +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[ +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1)
[ +0.000002] hci_chan_list_flush: hcon f53d56e0
[ +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0
[ +0.004528] l2cap_sock_create: sock e708fc00
[ +0.000023] l2cap_chan_create: chan ee4b1770
[ +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1
[ +0.000002] l2cap_sock_init: sk ee4b3390
[ +0.000029] l2cap_sock_bind: sk ee4b3390
[ +0.000010] l2cap_sock_setsockopt: sk ee4b3390
[ +0.000037] l2cap_sock_connect: sk ee4b3390
[ +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00
[ +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f
[ +0.000001] hci_dev_hold: hci0 orig refcnt 8
[ +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0
Above the l2cap_chan_connect() shouldn't have been able to reach the
hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper
locking that's not the case. The end result is a reference to hci_conn
that's not in the conn_hash list, resulting in list corruption when
trying to remove it later:
[Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT
[ +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT
[ +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT
[ +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT
[ +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4
[ +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1
[ +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3
[ +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
[ +0.000003] hci_chan_list_flush: hcon f53d56e0
[ +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0
[ +0.000001] ------------[ cut here ]------------
[ +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71()
[ +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200)
The necessary fix is unfortunately more complicated than just adding
hci_dev_lock/unlock calls to the hci_conn_timeout() call path.
Particularly, the hci_conn_del() API, which expects the hci_dev lock to
be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which
would lead to a deadlock if the hci_conn_timeout() call path tries to
acquire the same lock.
This patch solves the problem by deferring the cleanup work to a
separate work callback. To protect against the hci_dev or hci_conn
going away meanwhile temporary references are taken with the help of
hci_dev_hold() and hci_conn_get().
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.3
Some drivers might have to restore certain settings after the init
procedure has been completed. This driver callback allows them to hook
into that stage. This callback is run just before the controller is
declared as powered up.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This macro is used at 802.15.4 6LoWPAN only and can be replaced by
memcmp with the interface broadcast address.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the IPHC related defines for doing bit manipulation
from global 6lowpan header to the iphc file which should the only one
implementation which use these defines.
Also move next header compression defines to their nhc implementation.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the lowpan_fetch_skb_u8 function for getting the iphc
bytes. Instead we using the generic which has a len parameter to tell
the amount of bytes to fetch.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the lowpan_header_decompress function by removing
inklayer related information from parameters. This is currently for
supporting short and extended address for iphc handling in 802154.
We don't support short address handling anyway right now, but there
exists already code for handling short addresses in
lowpan_header_decompress.
The address parameters are also changed to a void pointer, so 6LoWPAN
linklayer specific code can put complex structures as these parameters
and cast it again inside the generic code by evaluating linklayer type
before. The order is also changed by destination address at first and
then source address, which is the same like all others functions where
destination is always the first, memcpy, dev_hard_header,
lowpan_header_compress, etc.
This patch also moves the fetching of iphc values from 6LoWPAN linklayer
specific code into the generic branch.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the lowpan_header_compress function by removing
unused parameters like "len" and drop static value parameters of
protocol type. Instead we really check the protocol type inside inside
the skb structure. Also we drop the use of IEEE802154_ADDR_LEN which is
link-layer specific. Instead we using EUI64_ADDR_LEN which should always
the default case for now.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduces the LOWPAN_IPHC_MAX_HC_BUF_LEN define which
represent the worst-case supported IPHC buffer length. It's used to
allocate the stack buffer space for creating the IPHC header.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Before the vendor specific setup stage is triggered call back into the
core to trigger an internal notification event. That event is used to
send an index update to the monitor interface. With that specific event
it is possible to update userspace with manufacturer information before
any HCI command has been executed. This is useful for early stage
debugging of vendor specific initialization sequences.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
If the diagnostic settings are not persistent over HCI Reset, then this
quirk can be used to tell the Bluetoth core about it. This will ensure
that the settings are programmed correctly when the controller is
powered up.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
There are LE devices on the market that start off by announcing their
public address and then once paired switch to using private address.
To be interoperable with such devices we should simply trust the fact
that we're receiving an IRK from them to indicate that they may use
private addresses in the future. Instead, simply tie the persistency
to the bonding/no-bonding information the same way as for LTKs and
CSRKs.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Conflicts:
drivers/net/usb/asix_common.c
net/ipv4/inet_connection_sock.c
net/switchdev/switchdev.c
In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.
The other two conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree. Most relevantly, updates for the nfnetlink_log to integrate with
conntrack, fixes for cttimeout and improvements for nf_queue core, they are:
1) Remove useless ifdef around static inline function in IPVS, from
Eric W. Biederman.
2) Simplify the conntrack support for nfnetlink_queue: Merge
nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
to nfnetlink_queue.c
3) Use y2038 safe timestamp from nfnetlink_queue.
4) Get rid of dead function definition in nf_conntrack, from Flavio
Leitner.
5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
controls enabling both nfqueue and nflog integration with conntrack.
The userspace application can request this via NFULNL_CFG_F_CONNTRACK
configuration flag.
6) Remove unused netns variables in IPVS, from Eric W. Biederman and
Simon Horman.
7) Don't put back the refcount on the cttimeout object from xt_CT on success.
8) Fix crash on cttimeout policy object removal. We have to flush out
the cttimeout extension area of the conntrack not to refer to an unexisting
object that was just removed.
9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
module removal.
10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.
11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
requested. Again from Ken-ichirou MATSUZAWA.
12) Don't use pointer to previous hook when reinjecting traffic via
nf_queue with NF_REPEAT verdict since it may be already gone. This
also avoids a deadloop if the userspace application keeps returning
NF_REPEAT.
13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.
14) Consolidate logger instance existence check in nfulnl_recv_config().
15) Fix broken atomicity when applying configuration updates to logger
instances in nfnetlink_log.
16) Get rid of the .owner attribute in our hook object. We don't need
this anymore since we're dropping pending packets that have escaped
from the kernel when unremoving the hook. Patch from Florian Westphal.
17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.
18) Use static inline function instead of macros to define NF_HOOK() and
NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
At the time of commit fff3269907 ("tcp: reflect SYN queue_mapping into
SYNACK packets") we had little ways to cope with SYN floods.
We no longer need to reflect incoming skb queue mappings, and instead
can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
affinities.
Note that all SYNACK retransmits were picking TX queue 0, this no longer
is a win given that SYNACK rtx are now distributed on all cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This merge resolves conflicts with 75aec9df3a ("bridge: Remove
br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve
netns support in the network stack that reached upstream via David's
net-next tree.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
net/bridge/br_netfilter_hooks.c
Greg reported crashes hitting the following check in __sk_backlog_rcv()
BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
The pfmemalloc bit is currently checked in sk_filter().
This works correctly for TCP, because sk_filter() is ran in
tcp_v[46]_rcv() before hitting the prequeue or backlog checks.
For UDP or other protocols, this does not work, because the sk_filter()
is ran from sock_queue_rcv_skb(), which might be called _after_ backlog
queuing if socket is owned by user by the time packet is processed by
softirq handler.
Fixes: b4b9e35585 ("netvm: set PF_MEMALLOC as appropriate during SKB processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't care if module is being unloaded anymore since hook unregister
handling will destroy queue entries using that hook.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Under stress, a close() on a listener can trigger the
WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
We need to test if listener is still active before queueing
a child in inet_csk_reqsk_queue_add()
Create a common inet_child_forget() helper, and use it
from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
In many cases we also need to release reference on request socket,
so add a helper to do this, reducing code size and complexity.
Fixes: 4bdc3d6614 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the attr usecase, the caller knows if he is holding RTNL and is
in atomic section. So let the called to decide the correct call variant.
This allows drivers to sleep inside their ops and wait for hw to get the
operation status. Then the status is propagated into switchdev core.
This avoids silent errors in drivers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When object is used in deferred work, we cannot use pointers in
switchdev object structures because the memory they point at may be already
used by someone else. So rather do local copy of the value.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caller should know if he can call attr_set directly (when holding RTNL)
or if he has to defer the att_set processing for later.
This also allows drivers to sleep inside attr_set and report operation
status back to switchdev core. Switchdev core then warns if status is
not ok, instead of silent errors happening in drivers.
Benefit from newly introduced switchdev deferred ops infrastructure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce infrastructure which will be used internally to defer ops.
Note that the deferred ops are queued up and either are processed by
scheduled work or explicitly by user calling deferred_process function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At listen() time, there is a small window where listener is visible with
a zero backlog, triggering a spurious "Possible SYN flooding on port"
message.
Nothing prevents us from setting the correct backlog.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As this API has never really seen any use and most drivers don't
ever use the value derived from it, remove it.
Change the only driver using it (rt2x00) to simply use the DTIM
period instead of the "max sleep" time.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Revert the commit e2ca690b65 ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:
The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.
while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add operations to retrieve cached IPv6 dst entry from l3mdev device
and lookup IPv6 source address.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the option to configure multiple 'scan plans' for scheduled scan.
Each 'scan plan' defines the number of scan cycles and the interval
between scans. The scan plans are executed in the order they were
configured. The last scan plan will always run infinitely and thus
defines only the interval between scans.
The maximum number of scan plans supported by the device and the
maximum number of iterations in a single scan plan are advertised
to userspace so it can configure the scan plans appropriately.
When scheduled scan results are received there is no way to know which
scan plan is being currently executed, so there is no way to know when
the next scan iteration will start. This is not a problem, however.
The scan start timestamp is only used for flushing old scan results,
and there is no difference between flushing all results received until
the end of the previous iteration or the start of the current one,
since no results will be received in between.
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For location and connectivity services, userspace would often like
to know the time when the BSS was last seen. The current "last seen"
value is calculated in a way that makes it less useful, especially
if the system suspended in the meantime.
Add the ability for the driver to report a real CLOCK_BOOTTIME stamp
that can then be reported to userspace (if present).
Drivers wishing to use this must be converted to the new API to call
cfg80211_inform_bss_data() or cfg80211_inform_bss_frame_data(). They
need to ensure the reported value is accurate enough even when the
frame might have been buffered in the device (e.g. firmware.)
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
[modified to use struct, inlines]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit 5c48f12017.
Some device drivers (ath10k) offload part of aggregation including AddBA/DelBA
negotiations to firmware. In such scenario, the PMF configuration of
the station needs to be provided to driver to enable encryption of
AddBA/DelBA action frames.
Signed-off-by: Tamizh chelvam <c_traja@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function nf_ct_frag6_gather is called on both the input and the
output paths of the networking stack. In particular ipv6_defrag which
calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
on input and the LOCAL_OUT chain on output.
The addition of a net parameter makes it explicit which network
namespace the packets are being reassembled in, and removes the need
for nf_ct_frag6_gather to guess.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.
So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.
The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:
Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.
If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reducing tcp_timewait_sock from 280 bytes to 272 bytes
allows SLAB to pack 15 objects per page instead of 14 (on x86)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.
By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.
These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.
Tested:
SYN flood hitting a 16 RX queues NIC.
TCP listener using 16 sockets and SO_REUSEPORT
and SO_INCOMING_CPU for proper siloing.
Could process 6.0 Mpps SYN instead of 4.2 Mpps
Kernel profile looked like :
11.68% [kernel] [k] sha_transform
6.51% [kernel] [k] __inet_lookup_listener
5.07% [kernel] [k] __inet_lookup_established
4.15% [kernel] [k] memcpy_erms
3.46% [kernel] [k] ipt_do_table
2.74% [kernel] [k] fib_table_lookup
2.54% [kernel] [k] tcp_make_synack
2.34% [kernel] [k] tcp_conn_request
2.05% [kernel] [k] __netif_receive_skb_core
2.03% [kernel] [k] kmem_cache_alloc
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SO_INCOMING_CPU as added in commit 2c8c56e15d was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()
This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.
This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.
Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful to allow users to set fwmark for an individual packet,
without changing the socket state. The function this patch adds in
sock layer can be used by the protocols that need such a feature.
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The object and module refcounts are updated for each conntrack template,
however, if we delete the iptables rules and we flush the timeout
database, we may end up with invalid references to timeout object that
are just gone.
Resolve this problem by setting the timeout reference to NULL when the
custom timeout entry is removed from our base. This patch requires some
RCU trickery to ensure safe pointer handling.
This handling is similar to what we already do with conntrack helpers,
the idea is to avoid bumping the timeout object reference counter from
the packet path to avoid the cost of atomic ops.
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows us to recurse over all the ports, skipping over unsupporting
ports. Without the change, the recursion would stop at first unsupported
port.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting the stage to push bridge-level attributes down to port driver so
hardware can be programmed accordingly. Bridge-level attribute example is
ageing_time. This is a per-bridge attribute, not a per-bridge-port attr.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the FDB add operation, propagate the
switchdev_obj_port_fdb structure in the DSA drivers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the prepare phase is pushed down to the DSA drivers, propagate
it to the port_fdb_add function.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Push the prepare phase for FDB operations down to the DSA drivers, with
a new port_fdb_prepare function. Currently only mv88e6xxx is affected.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-10-08
Here's another set of Bluetooth & 802.15.4 patches for the 4.4 kernel.
802.15.4:
- Many improvements & fixes to the mrf24j40 driver
- Fixes and cleanups to nl802154, mac802154 & ieee802154 code
Bluetooth:
- New chipset support in btmrvl driver
- Fixes & cleanups to btbcm, btmrvl, bpa10x & btintel drivers
- Support for vendor specific diagnostic data through common API
- Cleanups to the 6lowpan code
- New events & message types for monitor channel
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
selinux needs few changes to accommodate fact that SYNACK messages
can be attached to a request socket, lacking sk_security pointer
(Only syncookies are still attached to a TCP_LISTEN socket)
Adds a new sk_listener() helper, and use it in selinux and sch_fq
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported by: kernel test robot <ying.huang@linux.intel.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves values for all lowpan interface to the shared
implementation of 6lowpan. This patch also quietly fixes the forgotten
IFF_NO_QUEUE flag for the bluetooth 6LoWPAN interface. An identically
commit is 4afbc0d ("net: 6lowpan: convert to using IFF_NO_QUEUE") which
wasn't changed for bluetooth 6lowpan.
All 6lowpan interfaces should be virtual with IFF_NO_QUEUE, using EUI64
address length, the mtu size is 1280 (IPV6_MIN_MTU) and the netdev type
is ARPHRD_6LOWPAN.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The network namespace is already passed into dst_output pass it into
dst->output lwt->output and friends.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop hidding the sk parameter with an inline helper function and make
all of the callers pass it, so that it is clear what the function is
doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only __ip6_local_out_sk has callers so rename __ip6_local_out_sk
__ip6_local_out and remove the previous __ip6_local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is confusing and silly hiding a parameter so modify all of
the callers to pass in the appropriate socket or skb->sk if
no socket is known.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For consistency with the other similar methods in the kernel pass a
struct sock into the dst_ops .local_out method.
Simplifying the socket passing case is needed a prequel to passing a
struct net reference into .local_out.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace dst_output_okfn with dst_output
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make unix_sk() just like inet[6]_sk() by constify'ing the sock
parameter.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new debugfs entry for enabling and disabling the vendor
diagnostic mode. It is only exposed for drivers that provide the
set_diag driver callback and actually have an option for vendor
specific diagnostic information.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Introduce hci_recv_diag function for HCI drivers to allow sending vendor
specific diagnostic messages into the Bluetooth core stack. The messages
are not processed, but they are forwarded to the monitor channel and can
be retrieved by user space diagnostic tools.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The Bluetooth public device address might change during controller setup
and it makes it a lot simpler for monitoring tools if they just get told
what the new address is. In addition include the manufacturer / company
information of the controller. That allows for easy vendor specific HCI
command and event handling.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
* many internal fixes, API improvements, cleanups, etc.
* full AP client state tracking in cfg80211/mac80211 from Ayala
* VHT support (in mac80211) for mesh
* some A-MSDU in A-MPDU support from Emmanuel
* show current TX power to userspace (from Rafał)
* support for netlink dump in vendor commands (myself)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJWEp5XAAoJEDBSmw7B7bqr8DsQAICgQL7gSkHUlc6rbMJ9MzX+
9W0SNpZHSmfE0ZsL3cCoeHbk5dGhX82GumIz4GeqtvIKUNHkC8qlnXJIKTEva+sp
PjcF1wS0qQFdt6sg/Zxq+4Q8lZrZf1xP9W0x0ORYi9d9qej07JAZku8zYt4agpNV
R4nCl/gKVF375aV8y+qi+WSZXx4j80dJkokoVk4hzotWjd0bGVL1T9YwDRzxg4FI
S0DnkxlsD3MRHJXq+9+DbF5cuTjCG2LZNcDIBy455eWN27j9CWgEPVXoySQjDgQc
ayf2siw7BccqnV84et0vi+0WYXdZCHm3zCen44s4vaCflhdGxdx48V+Lib6mluR3
OEM1V1l9uV97UyORPljRKvDURq2IUdLQw00of26CTX8qEnmQIfxC7qaRg0rYEiGW
SbTClbEiEkBLV+sCStnkv8GJHNpvtI/2VQXH1ydrHsrWC3Sl9bpPOWYlNBPwdzM9
U4zgpxf6gLqlsukQKmMDmoKW7T04Fs0qgE99ThU2x6uTGsux8bfbxgzPCfUdeY8M
HmCB5oBCZKJ5pzv6z6lUGc0cO42IL50aBrrlatrEekjevUXW3MMOZCwGrUXxpMw1
gd+2PnLCCUeDyKNvkpXEgr4uS9Egc0sWH1RlpDPaAA5gRdRHiDn7MK7Z+s5OpNIC
wnFCQKB+KrNNrQFuXz9k
=BF9F
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-10-05' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For the current cycle, we have the following right now:
* many internal fixes, API improvements, cleanups, etc.
* full AP client state tracking in cfg80211/mac80211 from Ayala
* VHT support (in mac80211) for mesh
* some A-MSDU in A-MPDU support from Emmanuel
* show current TX power to userspace (from Rafał)
* support for netlink dump in vendor commands (myself)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add operation to l3mdev to lookup source address for a given flow.
Add support for the operation to VRF driver and convert existing
IPv4 hooks to use the new lookup.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VRF device needs the same path selection following lookup to set source
address. Rather than duplicating code, move existing code into a
function that is exported to modules.
Code move only; no functional change.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I’m using the compilation flag -Werror=old-style-declaration, which
requires that the “inline” word would come at the beginning of the code
line.
$ make drivers/net/ethernet/intel/e1000e/e1000e.ko
...
include/net/inet_timewait_sock.h:116:1: error: ‘inline’ is not at
beginning of declaration [-Werror=old-style-declaration]
static void inline inet_twsk_schedule(struct inet_timewait_sock *tw, int
timeo)
include/net/inet_timewait_sock.h:121:1: error: ‘inline’ is not at
beginning of declaration [-Werror=old-style-declaration]
static void inline inet_twsk_reschedule(struct inet_timewait_sock *tw,
int timeo)
Fixes: ed2e923945 ("tcp/dccp: fix timewait races in timer handling")
Signed-off-by: Raanan Avargil <raanan.avargil@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman says:
====================
net: Pass net through ip fragmention
This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.
This round focuses on passing net through ip fragmentation which we seem
to call from about everywhere. That is the main ip output paths, the
bridge netfilter code, and openvswitch. This has to happend at once
accross the tree as function pointers are involved.
First some prep work is done, then ipv4 and ipv6 are converted and then
temporary helper functions are removed.
====================
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.
Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replaces the per-packet multipath with a hash-based multipath using
source and destination address.
Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.
These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
sk_dst_cache pointer.
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SYN_RECV & TIMEWAIT sockets are not full blown,
do not even try to call ip_sk_use_pmtu() on them.
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the core starts or shuts down the actual HCI transport, send a new
monitor event that indicates that this is happening. These new events
correspond to HCI_DEV_OPEN and HCI_DEV_CLOSE events.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When opening the HCI transport via hdev->open send HCI_DEV_OPEN event
and when closing the HCI transport via hdev->close send HCI_DEV_CLOSE.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The original intention was to avoid dependencies between nfnetlink_queue and
conntrack without ifdef pollution. However, we can achieve this by moving the
conntrack dependent code into ctnetlink and keep some glue code to access the
nfq_ct indirection from nfqueue.
After this patch, the nfq_ct indirection is always compiled in the netfilter
core to avoid polluting nfqueue with ifdefs. Thus, if nf_conntrack is not
compiled this results in only 8-bytes of memory waste in x86_64.
This patch also adds ctnetlink_nfqueue_seqadj() to avoid that the nf_conn
structure layout if exposed to nf_queue, which creates another dependency with
nf_conntrack at compilation time.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :
- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.
Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.
req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags
Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the struct name in sync with object id name.
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be aligned with obj.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.
Also rounding to powers of two was not very friendly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is enough to check listener sk_state, no need for an extra
condition.
max_qlen_log can be moved into struct request_sock_queue
We can remove syn_wait_lock and the alignment it enforced.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.
(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)
By attaching the skb to the request socket, we remove this
source of contention.
Tested:
listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.
20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet6_csk_search_req() and inet6_csk_reqsk_queue_hash_add()
no longer exist.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer use hash_rnd, nr_table_entries and syn_table[]
For a listener with a backlog of 10 millions sockets, this
saves 80 MBytes of vmalloced memory.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.
ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.
In nominal conditions, this halves pressure on listener lock.
Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.
We will shrink listen_sock in the following patch to ease
code review.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener
get_openreq6() also gets a const for its request socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to use generic functions to insert request sockets
into ehash table.
sk_prot needs to be set (to retrieve sk_prot->h.hashinfo)
sk_node needs to be cleared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
long term plan is to remove struct listen_sock when its hash
table is no longer there.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.
Everything needs to be atomic for upcoming lockless listener.
Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)
We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch change the length check to len instead of mac_len for
checking if the frame control field is available to dereference.
We need to change it because I saw issues with af_packet raw sockets
and the mrf24j40 which calls this functionality. The raw socket
functionality doesn't set the mac_len but resets the skb_mac_header to
skb->data which is still correct. The issue occur at mrf24j40 only,
because the driver need to evaluate the fc fields.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds support for accessing mac802154 llsec implementation
over nl802154. I added for a new Kconfig entry to provide this
functionality CONFIG_IEEE802154_NL802154_EXPERIMENTAL. This interface is
still in development. It provides to change security parameters and
add/del/dump entries of security tables. Later we can add also a get to
get an entry by unique identifier.
Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds missing inline wrappers for nla_get_le32 and
nla_get_le64. The 802.15.4 MAC byteorder is little endian and we keep
the byteorder for fields like address configuration in the same
byteorder as it comes from the MAC layer.
To provide these fields for nl802154 userspace applications, we need
these inline wrappers for netlink.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following pull request contains Netfilter/IPVS updates for net-next
containing 90 patches from Eric Biederman.
The main goal of this batch is to avoid recurrent lookups for the netns
pointer, that happens over and over again in our Netfilter/IPVS code. The idea
consists of passing netns pointer from the hook state to the relevant functions
and objects where this may be needed.
You can find more information on the IPVS updates from Simon Horman's commit
merge message:
c3456026ad ("Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next").
Exceptionally, this time, I'm not posting the patches again on netdev, Eric
already Cc'ed this mailing list in the original submission. If you need me to
make, just let me know.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that switchdev and its drivers directly use specific switchdev_obj_*
structures, move them out of the switchdev_obj union and get rif of this
outer structure.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev add and del operations to:
int switchdev_port_obj_add/del(struct net_device *dev,
enum switchdev_obj_id id, void *obj);
This allows the caller to pass a specific switchdev_obj_* structure
instead of the generic switchdev_obj one.
Drivers implementation of these operations and switchdev have been
changed accordingly.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the notifier_call callback of a notifier_block, change the
function signature of switchdev dump operation to:
int switchdev_port_obj_dump(struct net_device *dev,
enum switchdev_obj_id id, void *obj,
int (*cb)(void *obj));
This allows the caller to pass and expect back a specific
switchdev_obj_* structure instead of the generic switchdev_obj one.
Drivers implementation of dump operation can now expect this specific
structure and call the callback with it. Drivers have been changed
accordingly.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net_device associated to a dump operation does not have to be passed
to the callback. switchdev stores it in a superset struct, if needed.
Also some drivers (such as DSA drivers) may not have easy access to it.
This will simplify pushing the callback function down to the drivers.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move remaining structs to VRF driver and delete the vrf header file.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to vrf_dev_get_rth with l3mdev_get_rtable.
The check on the flow flags is handled in the l3mdev operation.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to vrf_dev_table and friends with l3mdev_fib_table
and kin.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either
l3mdev_master_ifindex_rcu or l3mdev_master_ifindex.
The pattern:
oif = vrf_master_ifindex(dev) ? : dev->ifindex;
is replaced with
oif = l3mdev_fib_oif(dev);
And remove the now unused vrf macros.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L3 master devices allow users of the abstraction to influence FIB lookups
for enslaved devices. Current API provides a means for the master device
to return a specific FIB table for an enslaved device, to return an
rtable/custom dst and influence the OIF used for fib lookups.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the
netif_is_vrf and netif_index_is_vrf macros.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While auditing TCP stack for upcoming 'lockless' listener changes,
I found I had to change fastopen_init_queue() to properly init the object
before publishing it.
Otherwise an other cpu could try to lock the spinlock before it gets
properly initialized.
Instead of adding appropriate barriers, just remove dynamic memory
allocations :
- Structure is 28 bytes on 64bit arches. Using additional 8 bytes
for holding a pointer seems overkill.
- Two listeners can share same cache line and performance would suffer.
If we really want to save few bytes, we would instead dynamically allocate
whole struct request_sock_queue in the future.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_syn_flood_action() will soon be called with unlocked socket.
In order to avoid SYN flood warning being emitted multiple times,
use xchg().
Extend max_qlen_log and synflood_warned fields in struct listen_sock
to u32
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions do not change the listener socket.
Goal is to make sure tcp_conn_request() is not messing with
listener in a racy way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some common IPv4/IPv6 code can be factorized.
Also constify cookie_init_sequence() socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We'll soon no longer hold listener socket lock, these
functions do not modify the socket in any way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before changing dccp_v6_request_recv_sock() sock argument
to const, we need to get rid of security_sk_classify_flow(),
and it seems doable by reusing inet6_csk_route_req() helper.
We need to add a proto parameter to inet6_csk_route_req(),
not assume it is TCP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factorize code to get tcp header from skb. It makes no sense
to duplicate code in callers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once we realize tcp_rcv_synsent_state_process() does not use
its 'len' argument and we get rid of it, then it becomes clear
this argument is no longer used in tcp_rcv_state_process()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
None of these functions need to change the socket, make it
const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Eric Dumazet this change replaces the
#define with a static inline function to enjoy
complaints by the compiler when misusing the API.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is easiliy available in state->net so use it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is needed so struct net can be pushed down into
ip_route_me_harder.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, cfg80211 rejects capability updates for existing entries
and as a result it's impossible to update entries that were added
unassociated, but that is necessary to go through the full station
states from userspace, adding a station before authentication etc.
Fix this by allowing updates to capabilities for stations that the
driver (or mac80211) assigned unassociated state. Drivers setting
the full station state support flag must use the new station type
for proper operation.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When debugging wireless powersave issues on the AP side it's quite helpful
to see our own beacons that are transmitted by the hardware/driver. However,
this is not that easy since beacons don't pass through the regular TX queues.
Preferably drivers would call ieee80211_tx_status also for tx'ed beacons
but that's not always possible. Hence, just send a copy of each beacon
generated by ieee80211_beacon_get_tim to monitor devices when they are
getting fetched by the driver.
Also add a HW flag IEEE80211_HW_BEACON_TX_STATUS that can be used by
drivers to indicate that they report TX status for beacons.
Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
(with a fix from Christian Lamparted rolled in)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Send a HCI command and wait for command complete event.
This function serializes the requests by grabbing the req_lock.
Signed-off-by: Loic Poulain <loic.poulain@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like
[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>
packets [2] and [3] can be generated at almost the same time.
If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.
S will have to retransmit the answer.
Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.
It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.
This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.
This removes the reorder source at the host level.
We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/arp.c
The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.
Signed-off-by: David S. Miller <davem@davemloft.net>
SYNACK packets are sent on behalf on unlocked listeners
or fastopen sockets. Mark socket as const to catch future changes
that might break the assumption.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This documents fact that listener lock might not be held
at the time SYNACK are sent.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to document that socket lock might not be held at this point.
skb_set_owner_w() and ipv6_local_error() are using proper atomic ops
or spinlocks, so we promote the socket to non const when calling them.
netfilter hooks should never assume socket lock is held,
we also promote the socket to non const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
listener socket is not locked when tcp_make_synack() is called.
We better make sure no field is written.
There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is used to build and send SYNACK packets,
possibly on behalf of unlocked listener socket.
Make sure we did not miss a write by making this socket const.
We no longer can use ip_select_ident() and have to either
set iph->id to 0 or directly call __ip_select_ident()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_dont_fragment() can accept const socket and dst
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
socket is not modified, make it const so that callers can
do the same if they need.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch
socket, lets add a const qualifier.
This will permit the same change in inet6_csk_route_req()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is used by TCP listener core, and listener socket shall
not be modified by inet_csk_route_req().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Very soon, TCP stack might call inet_csk_route_req(), which
calls inet_csk_route_req() with an unlocked listener socket,
so we need to make sure ip_route_output_flow() is not trying to
change any field from its socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, since we have only 2 values for transaction phase, just use bool.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer used by drivers, as transaction queue with item destructors
takes care of abort phase internally in switchdev code. So kill it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shouldn't have been there in the first place. Now it is unused, kill it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helpers which should be used int attr_set/obj_add switchdev ops to
check the phase of transaction.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Abort item destruction
is handled as well.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
Second Round of IPVS Updates for v4.4
please consider these bug fixes and extensive clean-ups of IPVS
from Eric Biederman for v4.4.
His excellent description of the changes, which is part of an even larger
set of clean-up work, is as follows:
I am gradually working my way through the netfilter stack passing struct
down into the netfilter hooks and from the netfilter hooks and from there
down into the functions that actually care. This removes the need for
netfilter functions to guess how to figure out how to compute which
network namespace they are in and instead provides a simple and reliable
method to do so.
The cleanups stand on their own but this is part of a larger effort to
have routes with an output device that is not in the current network
namespace.
The IPVS code has been a bit more of a challenge than most. Just passing
struct net through to where it is needed did not feel clean to me. The
practical issue is that the ipvs code in most places actually wants
struct netns_ipvs and not struct net.
So as part of this process I have turned the relationship between struct
net and the structs netns_ipvs, ip_vs_conn_param, ip_vs_conn, and
ip_vs_service inside out. I have modified the ipvs functions to take a
struct netns_ipvs not a struct net. The net is code with fewer
conversions from one type of structure to another. I did wind up adding
a struct netns_ipvs parameter to quite a few functions that did not have
it before so I could pass the structure down from the netfilter hooks to
where it is actually needed to avoid guessing.
I have broken up the work in a bunch of small patches so there is at
least a chance and reviewing that each step I took is correct. The
series compiles at each step so bisecting it should not be a problem if
something weird comes up.
The first two changes in this series are actually bug fixes. The first
is a compile fix for a bug in sctp that came in, in the last round of
ipvs changes merged into nf-next. The second fixes an older bug where in
pathological circumstances the wrong network namespace could be used when
a proc file is written to.
The rest of the patchset is a bunch of boring changes getting pushing
struct netns_ipvs (and by extension ipvs->net) where it needs to be.
Either by replacing struct net pointers or adding new struct netns_ipvs
pointers. With a handful of other minor cleanups (like removing
skb_net).
I have decided include the bug fixes in this pull request. Patch one
relates to a bug that was added to nf-next recently and is thus not
applicable to nf . Patch two could arguably be promoted to a fix for v4.3
and stable though it does not appear to be severe enough to warrant that
course of action; let me know if you would like me to reconsider.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.
We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.
The only thing to do is to "reverse" the ip_tunnel_info.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 12fd84f438 ("ipv6: Remove unused neigh argument for
icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na
and ndisc_send_ns is unused.
CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The genl_notify function has too many arguments for no real reason - all
callers use genl_info to get them anyway. Just pass the genl_info down to
genl_notify.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function adds no real value and it obscures what the code is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This hack has no more users so remove it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The argument is unnecessary and in practice confusing,
and has caused the callers to do all manner of silly things.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
With sysctl_cache_bypass now a compile time constant the compiler can
figue out that it can elimiate all of the code that depends on
sysctl_cache_bypass being true.
Also remove the duplicate computation of net previously necessitated
by #ifdef CONFIG_SYSCTL
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This moves the hack "net_ipvs(skb_net(skb))" up one level where it
will be easier to remove.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Move the hack of relying on "net_ipvs(skb_net(skb))" to derive the
ipvs up a layer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Stop relying on "net_ipvs(skb_net(skb))" to derive the ipvs as
skb_net is a hack.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Also move the tests for net_ipvs being NULL into __ip_vs_ftp_init
and __ip_vs_ftp_exit. The only places where they possibly make
sense.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of param->net to access param->ipvs->net instead.
In functions where we are searching for an svc and filtering by net
filter by ipvs instead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
ipvs is what is actually desired so change the parameter and the modify
the callers to pass struct netns_ipvs.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of param->net to access param->ipvs->net instead.
When lookup up struct ip_vs_conn in a hash table replace comparisons
of cp->net with comparisons of cp->ipvs which is possible
now that ipvs is present in ip_vs_conn_param.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data. So store a pointer to
struct netns_ipvs.
Update the accesses of conn->net to access conn->ipvs->net instead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:
1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
proc knob to enable/disable it. By default is off to retain the old
behaviour. Patchset from Alex Gartrell.
I'm also including what Alex originally said for the record:
"The configuration of ipvs at Facebook is relatively straightforward. All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie. For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).
The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance. With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery. Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.
To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp. If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."
2) Add another proc entry to ignore tunneled packets to avoid routing loops
from IPVS, also from Alex.
3) Fifteen patches from Eric Biederman to:
* Stop passing nf_hook_ops as parameter to the hook and use the state hook
object instead all around the netfilter code, so only the private data
pointer is passed to the registered hook function.
* Now that we've got state->net, propagate the netns pointer to netfilter hook
clients to avoid its computation over and over again. A good example of how
this has been simplified is the former TEE target (now nf_dup infrastructure)
since it has killed the ugly pick_net() function.
There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit f9a060f4b2.
No driver has turned up needing this functionality, and I've just
implemented the functionality I wanted this for in a different
way. Thus, remove it again, until somebody shows up with a need
for having it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Drivers may be interested in receiving A-MSDU within A-MDPU.
Not all the devices may be able to do so, make it configurable.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Advertise the capability to send A-MSDU within A-MPDU
in the AddBA request sent by mac80211. Let the driver
know about the peer's capabilities.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the cfg80211's frame registration api receives wdev, however
mac80211 assumes per device filter configuration and ignores wdev.
Per device filtering is too wasteful, especially for multi-channel
devices.
Introduce new per vif frame registration API and use it for probe
request registrations in ieee80211_mgmt_frame_register()
Also call directly to ieee80211_configure_filter instead of using a work
since it is now allowed to sleep in ieee80211_mgmt_frame_register.
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch cleanups needed_headroom, needed_tailroom and hard_header_len
fields for wpan and lowpan interfaces.
For wpan interfaces the worst case mac header len should be part of
needed_headroom, currently this is set as hard_header_len, but
hard_header_len should be set to the minimum header length which xmit
call assumes and this is the minimum frame length of 802.15.4.
The hard_header_len value will check inside send callbacl of AF_PACKET
raw sockets.
For lowpan interfaces, if fragmentation isn't needed the skb will
call dev_hard_header for 802154 layer and queue it afterwards. This
happens without new skb allocation, so we need the same headroom and
tailroom lengths like 802154 inside 802154 6lowpan layer. At least we
assume as minimum header length an ipv6 header size.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The current header_ops callback structure of net device are used mostly
from 802.15.4 upper-layers. Because this callback structure is a very
generic one, which is also used by e.g. DGRAM AF_PACKET sockets, we
can't make this callback structure 802.15.4 specific which is currently
is.
I saw the smallest "constraint" for calling this callback with
dev_hard_header/dev_parse_header by AF_PACKET which assign a 8 byte
array for address void pointers. Currently 802.15.4 specific protocols
like af802154 and 6LoWPAN will assign the "struct ieee802154_addr" as
these parameters which is greater than 8 bytes. The current callback
implementation for header_ops.create assumes always a complete
"struct ieee802154_addr" which AF_PACKET can't never handled and is
greater than 8 bytes.
For that reason we introduce now a "generic" create/parse header_ops
callback which allows handling with intra-pan extended addresses only.
This allows a small use-case with AF_PACKET to send "somehow" a valid
dataframe over DGRAM.
To keeping the current dev_hard_header behaviour we introduce a similar
callback structure "wpan_dev_header_ops" which contains 802.15.4 specific
upper-layer header creation functionality, which can be called by
wpan_dev_hard_header.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Sometimes upper-layer protocols wants to generate a new mac header by
filling "struct ieee802154_hdr" only. These upper-layers sets for the
address settings the source and dest fields, but not the fc fields for
indicate the source and dest address mode. This patch changes the
"ieee802154_hdr_push" function so the fc address fields are set
according the source and dest fields of "struct ieee802154_hdr".
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.
As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.
This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.
Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.
To make things more readable I introduced inet_twsk_reschedule() helper.
When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.
Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.
A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.
Fixes: 789f558cfb ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.
This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.
For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)
For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.
If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().
One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The iucv code uses arrays as arguments. Even though this does not
really cause a problem, it could be misleading, since the compiler
turns array arguments into just a pointer argument. To be more
precise this patch changes the array arguments into pointers.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-09-18
Here's the first bluetooth-next pull request for the 4.4 kernel:
- ieee802154 cleanups & fixes
- debugfs support for the at86rf230 driver
- Support for quirky (seemingly counterfeit) CSR Bluetooth controllers
- Power management and device config improvements for Intel controllers
- Fix for devices with incorrect advertising data length
- Fix for closing HCI user channel socket
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Man page of ip-route(8) says following about route types:
unreachable - these destinations are unreachable. Packets are dis‐
carded and the ICMP message host unreachable is generated. The local
senders get an EHOSTUNREACH error.
blackhole - these destinations are unreachable. Packets are dis‐
carded silently. The local senders get an EINVAL error.
prohibit - these destinations are unreachable. Packets are discarded
and the ICMP message communication administratively prohibited is
generated. The local senders get an EACCES error.
In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.
In both address families all three route types now behave consistently
with documentation.
Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of calling dev_net on a likley looking network device
pass state->net into nf_xfrm_me_harder.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only pass the void *priv parameter out of the nf_hook_ops. That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.
Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stop guessing the struct net instead of remember it. Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows them to stop guessing the network namespace with pick_net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_pktinfo is passed on the stack so this does not bloat any in core
data structures.
By centrally computing this information this makes maintence of the code
simpler, and understading of the code easier.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As xt_action_param lives on the stack this does not bloat any
persistent data structures.
This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum
This simplifies the code, makes it more readable, and likely reduces
cache line misses. Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simon Horman says:
====================
IPVS Updates for v4.4
please consider these IPVS Updates for v4.4.
The updates include the following from Alex Gartrell:
* Scheduling of ICMP
* Sysctl to ignore tunneled packets; and hence some packet-looping scenarios
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds ratelimited version of the BT_ERR macro.
Signed-off-by: Szymon Janc <ext.szymon.janc@tieto.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.
Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps
and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps
To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps
For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.
Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
/* classify arp, ip, ipv6 into different traffic classes
* and drop all other packets
*/
switch (skb->protocol) {
case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
default:
return TC_ACT_SHOT;
}
return TC_ACT_OK;
}
Joint work with Daniel Borkmann.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf
on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
We'd like to provide one as well for SYNACK packets, so that all packets
of a given flow share same txhash, to later enable bonding driver to
also use skb->hash to perform slave selection.
Note that a SYNACK retransmit shuffles the tx hash, as Tom did
in commit 265f94ff54 ("net: Recompute sk_txhash on negative routing
advice") for established sockets.
This has nice effect making TCP flows resilient to some kind of black
holes, even at connection establish phase.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.
As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.
To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen reported that the recent change to add oif to dst lookups breaks
the VTI use case. The problem is that with the oif set in the flow struct
the comparison to the nh_oif is triggered. Fix by splitting the
FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
nh oif (FLOWI_FLAG_SKIP_NH_OIF).
Fixes: 42a7b32b73 ("xfrm: Add oif to dst lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications.
This makes nlm_flags in ipv6 replace notifications consistent
with ipv4.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a workaround for datagram_size calculation while
doing fragmentation on transmit.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds frame control checks to check if the received frame is
something which could contain a 6LoWPAN packet.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch complete reworks the evaluation of 6lowpan dispatch value by
introducing a receive handler mechanism for each dispatch value.
A list of changes:
- Doing uncompression on-the-fly when FRAG1 is received, this require
some special handling for 802.15.4 lltype in generic 6lowpan branch
for setting the payload length correct.
- Fix dispatch mask for fragmentation.
- Add IPv6 dispatch evaluation for FRAG1.
- Add skb_unshare for dispatch which might manipulate the skb data
buffer.
Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
With 9380f9eacf the order of unsetting
the HCI_USER_CHANNEL flag of the HCI device was reverted to ensure
the device is first closed before making it available again.
Due to hci_dev_close checking for HCI_USER_CHANNEL being set on the
device it was never really closed and was kept opened. We're now
calling hci_dev_do_close directly to make sure the device is correctly
closed and we keep the correct order to unset the flag on our device
object.
Signed-off-by: Simon Fels <simon.fels@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add specific bluetooth device logging macros since hci device name is
repeatedly referred in bluetooth subsystem logs.
Signed-off-by: Loic Poulain <loic.poulain@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This is a way to avoid nasty routing loops when multiple ipvs instances can
forward to eachother.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Many commonly used functions like getifaddrs() invoke RTM_GETLINK
to dump the interface information, and do not need the
the AF_INET6 statististics that are always returned by default
from rtnl_fill_ifinfo().
Computing the statistics can be an expensive operation that impacts
scaling, so it is desirable to avoid this if the information is
not needed.
This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
can be passed with netlink_request() to avoid statistics computation
for the ifinfo path.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses a seqlock to ensure consistency between idst->dst and
idst->cookie. It also makes dst freeing from fib tree to undergo a
rcu grace period.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Problems in the current dst_entry cache in the ip6_tunnel:
1. ip6_tnl_dst_set is racy. There is no lock to protect it:
- One major problem is that the dst refcnt gets messed up. F.e.
the same dst_cache can be released multiple times and then
triggering the infamous dst refcnt < 0 warning message.
- Another issue is the inconsistency between dst_cache and
dst_cookie.
It can be reproduced by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.
2. ip6_tnl_dst_get does not take the dst refcnt before returning
the dst.
This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. ip6_tnl_dst_get always takes the dst refcnt before returning
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.
This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
conventional name matching with ip6_tnl_dst_get().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the FIB table id to rtable to make the information available for
IPv4 as it is for IPv6.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix out-of-bounds array access in netfilter ipset, from Jozsef
Kadlecsik.
2) Use correct free operation on netfilter conntrack templates, from
Daniel Borkmann.
3) Fix route leak in SCTP, from Marcelo Ricardo Leitner.
4) Fix sizeof(pointer) in mac80211, from Thierry Reding.
5) Fix cache pointer comparison in ip6mr leading to missed unlock of
mrt_lock. From Richard Laing.
6) rds_conn_lookup() needs to consider network namespace in key
comparison, from Sowmini Varadhan.
7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov
Dmitriy.
8) Fix fd leaks in bpf syscall, from Daniel Borkmann.
9) Fix error recovery when installing ipv6 multipath routes, we would
delete the old route before we would know if we could fully commit
to the new set of nexthops. Fix from Roopa Prabhu.
10) Fix run-time suspend problems in r8152, from Hayes Wang.
11) In fec, don't program the MAC address into the chip when the clocks
are gated off. From Fugang Duan.
12) Fix poll behavior for netlink sockets when using rx ring mmap, from
Daniel Borkmann.
13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169
driver, from Corinna Vinschen.
14) In TCP Cubic congestion control, handle idle periods better where we
are application limited, in order to keep cwnd from growing out of
control. From Eric Dumzet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
tcp_cubic: better follow cubic curve after idle period
tcp: generate CA_EVENT_TX_START on data frames
xen-netfront: respect user provided max_queues
xen-netback: respect user provided max_queues
r8169: Fix sleeping function called during get_stats64, v2
ether: add IEEE 1722 ethertype - TSN
netlink, mmap: fix edge-case leakages in nf queue zero-copy
netlink, mmap: don't walk rx ring on poll if receive queue non-empty
cxgb4: changes for new firmware 1.14.4.0
net: fec: add netif status check before set mac address
r8152: fix the runtime suspend issues
r8152: split DRIVER_VERSION
ipv6: fix ifnullfree.cocci warnings
add microchip LAN88xx phy driver
stmmac: fix check for phydev being open
net: qlcnic: delete redundant memsets
net: mv643xx_eth: use kzalloc
net: jme: use kzalloc() instead of kmalloc+memset
net: cavium: liquidio: use kzalloc in setup_glist()
net: ipv6: use common fib_default_rule_pref
...
This switches IPv6 policy routing to use the shared
fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
multicast routing for IPv4 as well as IPv6.
The motivation for this patch is a complaint about iproute2 behaving
inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
assigned priority value was decreased with each rule added.
Since then all users of the default_pref field have been converted to
assign the generic function fib_default_rule_pref(), fib_nl_newrule()
may just use it directly instead. Therefore get rid of the function
pointer altogether and make fib_default_rule_pref() static, as it's not
used outside fib_rules.c anymore.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing
read and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc. mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
/T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
30ZYFuW8evTL+YQcaV65
=EHLg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
The only user is sock_update_memcg which is living in memcontrol.c so it
doesn't make much sense to pollute sock.h by this inline helper. Move it
to memcontrol.c and open code it into its only caller.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup structure is defined in mm/memcontrol.c currently which means
that the code outside of this file has to use external API even for
trivial access stuff.
This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines. This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)
text data bss dec hex filename
12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
12354970 1823792 1089536 15268298 e8f9ca vmlinux.after
This is not much (370B) but better than nothing.
We also save a function call in some hot paths like callers of
mem_cgroup_count_vm_event which is used for accounting.
The patch doesn't introduce any functional changes.
[vdavykov@parallels.com: inline memcg_kmem_is_active]
[vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
[akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
[akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix for the sizeof() pointer type issue
* a fix for regulatory getting into a restore loop
* a fix for rfkill global 'all' state, it needs to be stored
everywhere to apply correctly to new rfkill instances
* properly refuse CQM RSSI when it cannot actually be used
* protect HT TDLS traffic properly in non-HT networks
* don't incorrectly advertise 80 MHz support when not allowed
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJV6av+AAoJEDBSmw7B7bqrNYYP/1kZjdk9s4wPHpQndfSWK/sj
yTjysh/VXZ5Jyzkcmg337AjEuuJ/SntDXjMXEKTxHHrt5fmHFBvmPErPUY7p7I/t
MJqxVnYGypGW4LImyXYRJbejabmPrxtHC+jcWibw24yBCzJGCXF/MeKCotksnvIn
IoJDvUHrVFZhfzFRepdozJi1Qy2Y04Q7zXywplW6XFYWxdfDWaWyno/xh+rM8oO7
xSgpkuj4FDfimDBjzJOHwBH3xOlt/uHoB97G0TJbJ6k0R6a8aTz8eK5OWB4rFcn+
BsuQfaWcrGg9fJsoqz4mYlWf/F4Qj/gU9Gr5sBUWWa0xQ7HctcgJ/t27XjUWP7oV
VnQ9YQt/SukG3BVf8GuNqaSo7/7oprWHdDYRH0tnvImXCWsUlHM8gv8MrG0UOHGC
4I4qgPLHon6Ui3lNHfivl8iSDFT+jfsMDMWiB6kKZRpUOdEicRrepMoDARVbTYck
h6uY4Tx6OQ39evLSXjyrCWb0aOr2WoDCypxEkpE+fgDsWr8yTV0Ti0QZfVqatO3d
d4GQb6U5Zkd14ymKtzULa5qjOw1TmnKpxVzSSdrONaPBVdUDsl4QH38gIppFjGsV
61Gn3PXRRgy+VAib/YyIYqilvdhZ2TdXaSVAEyKv6WjKavOFA/8KsDPmJsyKoCSn
1dDOwaiqgfu2bK1PTU9A
=xw8r
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-09-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
For the first round of fixes, we have this:
* fix for the sizeof() pointer type issue
* a fix for regulatory getting into a restore loop
* a fix for rfkill global 'all' state, it needs to be stored
everywhere to apply correctly to new rfkill instances
* properly refuse CQM RSSI when it cannot actually be used
* protect HT TDLS traffic properly in non-HT networks
* don't incorrectly advertise 80 MHz support when not allowed
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/netfilter/nf_conntrack.h
The conflict was an overlap between changing the type of the zone
argument to nf_ct_tmpl_alloc() whilst exporting nf_ct_tmpl_free.
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net, they are:
1) Oneliner to restore maps in nf_tables since we support addressing registers
at 32 bits level.
2) Restore previous default behaviour in bridge netfilter when CONFIG_IPV6=n,
oneliner from Bernhard Thaler.
3) Out of bound access in ipset hash:net* set types, reported by Dave Jones'
KASan utility, patch from Jozsef Kadlecsik.
4) Fix ipset compilation with gcc 4.4.7 related to C99 initialization of
unnamed unions, patch from Elad Raz.
5) Add a workaround to address inconsistent endianess in the res_id field of
nfnetlink batch messages, reported by Florian Westphal.
6) Fix error paths of CT/synproxy since the conntrack template was moved to use
kmalloc, patch from Daniel Borkmann.
All of them look good to me to reach 4.2, I can route this to -stable myself
too, just let me know what you prefer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
HT TDLS traffic should be protected in a non-HT BSS to avoid
collisions. Therefore, when TDLS peers join/leave, check if
protection is (now) needed and set the ht_operation_mode of
the virtual interface according to the HT capabilities of the
TDLS peer(s).
This works because a non-HT BSS connection never sets (or
otherwise uses) the ht_operation_mode; it just means that
drivers must be aware that this field applies to all HT
traffic for this virtual interface, not just the traffic
within the BSS. Document that.
Signed-off-by: Avri Altman <avri.altman@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fengguang reported, that some randconfig generated the following linker
issue with nf_ct_zone_dflt object involved:
[...]
CC init/version.o
LD init/built-in.o
net/built-in.o: In function `ipv4_conntrack_defrag':
nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
net/built-in.o: In function `ipv6_defrag':
nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt'
make: *** [vmlinux] Error 1
Given that configurations exist where we have a built-in part, which is
accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
area when netfilter is configured in general.
Therefore, split the more generic parts into a common header under
include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
section that already holds parts related to CONFIG_NF_CONNTRACK in the
netfilter core. This fixes the issue on my side.
Fixes: 308ac9143e ("netfilter: nf_conntrack: push zone object into functions")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just have a flags member instead.
In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from net/core/flow_dissector.c:1:
In function 'flow_keys_hash_start',
inlined from 'flow_hash_from_keys' at net/core/flow_dissector.c:553:34:
>> include/linux/compiler.h:447:38: error: call to '__compiletime_assert_459' declared with attribute error: BUILD_BUG_ON failed: FLOW_KEYS_HASH_OFFSET % sizeof(u32)
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an input flag to flow dissector on rather dissection should stop
when encapsulation is detected (IP/IP or GRE). Also, add a key_control
flag that indicates encapsulation was encountered during the
dissection.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an input flag to flow dissector on rather dissection should be
stopped when a flow label is encountered. Presumably, the flow label
is derived from a sufficient hash of an inner transport packet so
further dissection is not needed (that is ports are not included in
the flow hash). Using the flow label instead of ports has the additional
benefit that packet fragments should hash to same value as non-fragments
for a flow (assuming that the same flow label is used).
We set this flag by default in for skb_get_hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an input flag to flow dissector on rather dissection should be
stopped when an L3 packet is encountered. This would be useful if a
caller just wanted to get IP addresses of the outermost header (e.g.
to do an L3 hash).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an input flag to flow dissector on rather dissection should be
attempted on a first fragment. Also add key_control flags to indicate
that a packet is a fragment or first fragment.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create __get_hash_from_flowi6 and __get_hash_from_flowi4 to get the
flow keys and hash based on flowi structures. These are called by
__skb_get_hash_flowi6 and __skb_get_hash_flowi4. Also, created
get_hash_from_flowi6 and get_hash_from_flowi4 which can be called
when just the hash value for a flowi is needed.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move __skb_set_sw_hash to skbuff.h and add __skb_set_hash which is
a common method (between __skb_set_sw_hash and skb_set_hash) to set
the hash in an skbuff.
Also, move skb_clear_hash to be closer to __skb_set_hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the flow dissector functions that are specific to skbuffs into
skbuff.h out of flow_dissector.h. This makes flow_dissector.h have
no dependencies on skbuff.h.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of VRF patches used 'int' for table id. It should be u32 to be
consistent with the rest of the stack.
Fixes:
4e3c89920c ("net: Introduce VRF related flags and helpers")
15be405eb2 ("net: Add inet_addr lookup by table")
30bbaa1950 ("net: Fix up inet_addr_type checks")
021dd3b8a1 ("net: Add routes to the table associated with the device")
dc028da54e ("inet: Move VRF table lookup to inlined function")
f6d3c19274 ("net: FIB tracepoints")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 0838aa7fcf ("netfilter: fix netns dependencies with conntrack
templates") migrated templates to the new allocator api, but forgot to
update error paths for them in CT and synproxy to use nf_ct_tmpl_free()
instead of nf_conntrack_free().
Due to that, memory is being freed into the wrong kmemcache, but also
we drop the per net reference count of ct objects causing an imbalance.
In Brad's case, this leads to a wrap-around of net->ct.count and thus
lets __nf_conntrack_alloc() refuse to create a new ct object:
[ 10.340913] xt_addrtype: ipv6 does not support BROADCAST matching
[ 10.810168] nf_conntrack: table full, dropping packet
[ 11.917416] r8169 0000:07:00.0 eth0: link up
[ 11.917438] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 12.815902] nf_conntrack: table full, dropping packet
[ 15.688561] nf_conntrack: table full, dropping packet
[ 15.689365] nf_conntrack: table full, dropping packet
[ 15.690169] nf_conntrack: table full, dropping packet
[ 15.690967] nf_conntrack: table full, dropping packet
[...]
With slab debugging, it also reports the wrong kmemcache (kmalloc-512 vs.
nf_conntrack_ffffffff81ce75c0) and reports poison overwrites, etc. Thus,
to fix the problem, export and use nf_ct_tmpl_free() instead.
Fixes: 0838aa7fcf ("netfilter: fix netns dependencies with conntrack templates")
Reported-by: Brad Jackson <bjackson0971@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
opts_size is only written and never read. Following patch
removes this unused variable.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sysctl will be used to enable the scheduling of icmp packets.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
No longer necessary since the information is included in the ip_vs_iphdr
itself.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
These flags contain information like whether or not the addresses are
inverted or from icmp. The first will allow us to drop an inverse param
all over the place, and the second will later be useful in scheduling icmp.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This removes some duplicated code and makes the ICMPv6 path look more like
the ICMP path.
Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
As David pointed out, spinlock are no longer needed
to protect the per cpu queues used in gro cells infrastructure.
Also use new napi_complete_done() API so that gro_flush_timeout
tweaks have an effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.
We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().
Joint work with Florian Westphal.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some consumers of the netdev events API would like to know who is the
active slave when a NETDEV_CHANGEUPPER or NETDEV_BONDING_FAILOVER
events occur. For example, when managing RoCE GIDs, GIDs based on the
bond's ips should only be set on the port which corresponds to active
slave netdevice.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For loopback purposes, RoCE devices should have a default GID in the
port GID table, even when the interface is down. In order to do so,
we use the IPv6 link local address which would have been genenrated
for the related Ethernet netdevice when it goes up as a default GID.
addrconf_ifid_eui48 is used to gernerate this address, export it.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.
In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.
Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.
Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree.
In sum, patches to address fallout from the previous round plus updates from
the IPVS folks via Simon Horman, they are:
1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm
directs network connections to the server with the highest weight that is
currently available and overflows to the next when active connections exceed
the node's weight. From Raducu Deaconu.
2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch
from Julian Anastasov.
3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From
Julian Anastasov.
4) Enhance multicast configuration for the IPVS state sync daemon. Also from
Julian.
5) Resolve sparse warnings in the nf_dup modules.
6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set.
7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative
subsets of code 1. From Andreas Herz.
8) Revert the jumpstack size calculation from mark_source_chains due to chain
depth miscalculations, from Florian Westphal.
9) Calm down more sparse warning around the Netfilter tree, again from Florian
Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the inetpeer_addr_base union to inetpeer_addr and drop
inetpeer_addr_base.
Both the a6 and in6_addr overlays are not needed; drop the __be32 version
and rename in6 to a6 for consistency with ipv4. Add a new u32 array to
the union which removes the need for the typecast in the compare function
and the use of a consistent arg for both ipv4 and ipv6 addresses which
makes the compare function more readable.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_metrics and inetpeer both have functions to compare inetpeer
addresses. Consolidate into 1 version.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use inetpeer set,get helpers in tcp_metrics rather than peeking into
the inetpeer_addr struct.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactors a common line into helper function.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
geneve_core module handles send and receive functionality.
This way OVS could use the Geneve API. Now with use of
tunnel meatadata mode OVS can directly use Geneve netdevice.
So there is no need for separate module for Geneve. Following
patch consolidates Geneve protocol processing in single module.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch create new tunnel flag which enable
tunnel metadata collection on given device. These devices
can be used by tunnel metadata based routing or by OVS.
Geneve Consolidation patch get rid of collect_md_tun to
simplify tunnel lookup further.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce function udp_tun_rx_dst() to initialize tunnel dst on
receive path.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.
CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.
We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add functions to change connlabel length into nf_conntrack_labels.c so
they may be reused by other modules like OVS and nftables without
needing to jump through xt_match_check() hoops.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This variation on skb_dst_copy() doesn't require two skbs.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to act_gact/act_mirred, act_bpf can be lockless in packet processing
with extra care taken to free bpf programs after rcu grace period.
Replacement of existing act_bpf (very rare) is done with synchronize_rcu()
and final destruction is done from tc_action_ops->cleanup() callback that is
called from tcf_exts_destroy()->tcf_action_destroy()->__tcf_hash_release() when
bind and refcnt reach zero which is only possible when classifier is destroyed.
Previous two patches fixed the last two classifiers (tcindex and rsvp) to
call tcf_exts_destroy() from rcu callback.
Similar to gact/mirred there is a race between prog->filter and
prog->tcf_action. Meaning that the program being replaced may use
previous default action if it happened to return TC_ACT_UNSPEC.
act_mirred race betwen tcf_action and tcfm_dev is similar.
In all cases the race is harmless.
Long term we may want to improve the situation by replacing the whole
tc_action->priv as single pointer instead of updating inner fields one by one.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_hash_destroy() used once. Make it static.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vxlan_get_sk_family inline function was added after the last #endif,
making multiple inclusion of net/vxlan.h fail. Move it to the proper place.
Reported-by: Mark Rustad <mark.d.rustad@intel.com>
Fixes: 705cc62f67 ("vxlan: provide access function for vxlan socket address family")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove various inlined functions not referenced in the kernel.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP pacing was added back in linux-3.12, we chose
to apply a fixed ratio of 200 % against current rate,
to allow probing for optimal throughput even during
slow start phase, where cwnd can be doubled every other gRTT.
At Google, we found it was better applying a different ratio
while in Congestion Avoidance phase.
This ratio was set to 120 %.
We've used the normal tcp_in_slow_start() helper for a while,
then tuned the condition to select the conservative ratio
as soon as cwnd >= ssthresh/2 :
- After cwnd reduction, it is safer to ramp up more slowly,
as we approach optimal cwnd.
- Initial ramp up (ssthresh == INFINITY) still allows doubling
cwnd every other RTT.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
slow start after idle might reduce cwnd, but we perform this
after first packet was cooked and sent.
With TSO/GSO, it means that we might send a full TSO packet
even if cwnd should have been reduced to IW10.
Moving the SSAI check in skb_entail() makes sense, because
we slightly reduce number of times this check is done,
especially for large send() and TCP Small queue callbacks from
softirq context.
As Neal pointed out, we also need to perform the check
if/when receive window opens.
Tested:
Following packetdrill test demonstrates the problem
// Test of slow start after idle
`sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.100 < . 1:1(0) ack 1 win 511
+0 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
+0 write(4, ..., 26000) = 26000
+0 > . 1:5001(5000) ack 1
+0 > . 5001:10001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10 }%
+.100 < . 1:1(0) ack 10001 win 511
+0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
+0 > . 10001:20001(10000) ack 1
+0 > P. 20001:26001(6000) ack 1
+.100 < . 1:1(0) ack 26001 win 511
+0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
+4 write(4, ..., 20000) = 20000
// If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
+0 > . 26001:31001(5000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 > . 31001:36001(5000) ack 1
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cfg and family arguments to lwt build state functions. cfg is a void
pointer and will either be a pointer to a fib_config or fib6_config
structure. The family parameter indicates which one (either AF_INET
or AF_INET6).
LWT encpasulation implementation may use the fib configuration to build
the LWT state.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the NFC pull request for 4.3.
With this one we have:
- A new driver for Samsung's S3FWRN5 NFC chipset. In order to
properly support this driver, a few NCI core routines needed
to be exported. Future drivers like Intel's Fields Peak will
benefit from this.
- SPI support as a physical transport for STM st21nfcb.
- An additional netlink API for sending replies back to userspace
from vendor commands.
- 2 small fixes for TI's trf7970a
- A few st-nci fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV1l7oAAoJEIqAPN1PVmxK7DwP+QF3A6wCBzaNQKcla6LOl+Ru
lGquPpFyihlDT/916IP7MnMNZYOP3ENdGll5lKts2yKxuty327Bb2UWNkaFP63Ei
+zZwhoZXwh6dbK35kwnd87Cwgn0E8vTF+zHhC2MmP8uGDIgOuevb//2GBDayG88T
rw4QMZsunT4o9x/bNK1uTlYaKPDs5pxXYFYUPXOQ2F5GqpUVFag3pLbZcJhpSZg7
9Y1KNLxwi/1rwO90JTXH9CQ+oWgVcH86nIlzqGznxgNdOoCF/V/0hdGlTzj8sE6T
A3a0Qy1gaWQw9+9QoqE7YWu6JfEIgEhRgFx8dm4SsmUbrnlmgYBavEFeG7++1AG5
QByWh/h2po0MysNRCfhey2EExlZgdvc1WyLQlS+0w+aWmM5MYq+J4lx+sqXdUDdu
MZyNDytRfGgRimBPSxyjuHtzrBWR8MyenjsbpraNoVDlFD8wVnGa5OcAv2gi7dkD
wloOSZj5jJq9WoMeGWEwp2TsQMySJDydBSTvgqovlk2K8gmeY69g2YUUmwR2Truf
fulBsO8upet/v2cRbCepI2X4NvS37wZBRcJtPpGcCoXUagA1vWH8eBKfjoduGERt
vc5c9DYjSA0VEJ3kzu3Atro/oNrZDDqvX1wEn+64fk1B8v53Lvf2X0ESQQUcfWpz
k7hPYzOt2IOA7d41CY59
=KMsn
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.3 pull request
This is the NFC pull request for 4.3.
With this one we have:
- A new driver for Samsung's S3FWRN5 NFC chipset. In order to
properly support this driver, a few NCI core routines needed
to be exported. Future drivers like Intel's Fields Peak will
benefit from this.
- SPI support as a physical transport for STM st21nfcb.
- An additional netlink API for sending replies back to userspace
from vendor commands.
- 2 small fixes for TI's trf7970a
- A few st-nci fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
__recnt and related fields need to be in its own cacheline for performance
reasons. Commit 61adedf3e3 ("route: move lwtunnel state to dst_entry")
broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered.
This patch fixes the breakage by moving the lwtunnel state to the end of
dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline
with __refcnt and may affect performance, thus further patches may be
needed.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 61adedf3e3 ("route: move lwtunnel state to dst_entry")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add calls to gro_cells infrastructure to do GRO when receiving on a tunnel.
Testing:
Ran 200 netperf TCP_STREAM instance
- With fix (GRO enabled on VXLAN interface)
Verify GRO is happening.
9084 MBps tput
3.44% CPU utilization
- Without fix (GRO disabled on VXLAN interface)
Verified no GRO is happening.
9084 MBps tput
5.54% CPU utilization
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- mcast_group: configure the multicast address, now IPv6
is supported too
- mcast_port: configure the multicast port
- mcast_ttl: configure the multicast TTL/HOP_LIMIT
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Allow setups with large MTU to send large sync packets by
adding sync_maxlen parameter. The default value is now based
on MTU but no more than 1500 for compatibility reasons.
To avoid problems if MTU changes allow fragmentation by
sending packets with DF=0. Problem reported by Dan Carpenter.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This is second pull request includes the conflict resolution patch that
resulted from the updates that we got for the conntrack template through
kmalloc. No changes with regards to the previously sent 15 patches.
The following patchset contains Netfilter updates for your net-next tree, they
are:
1) Rework the existing nf_tables counter expression to make it per-cpu.
2) Prepare and factor out common packet duplication code from the TEE target so
it can be reused from the new dup expression.
3) Add the new dup expression for the nf_tables IPv4 and IPv6 families.
4) Convert the nf_tables limit expression to use a token-based approach with
64-bits precision.
5) Enhance the nf_tables limit expression to support limiting at packet byte.
This comes after several preparation patches.
6) Add a burst parameter to indicate the amount of packets or bytes that can
exceed the limiting.
7) Add netns support to nfacct, from Andreas Schultz.
8) Pass the nf_conn_zone structure instead of the zone ID in nf_tables to allow
accessing more zone specific information, from Daniel Borkmann.
9) Allow to define zone per-direction to support netns containers with
overlapping network addressing, also from Daniel.
10) Extend the CT target to allow setting the zone based on the skb->mark as a
way to support simple mappings from iptables, also from Daniel.
11) Make the nf_tables payload expression aware of the fact that VLAN offload
may have removed a vlan header, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use flowi_tunnel in flowi6 similarly to what is done with IPv4.
This complements commit 1b7179d3ad ("route: Extend flow representation
with tunnel key").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
If output device wants to see the dst, inherit the dst of the original skb
in the ndisc request.
This is an IPv6 counterpart of commit 0accfc268f ("arp: Inherit metadata
dst when creating ARP requests").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.
As a bonus, this brings a nice diffstat.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the ipv4_tos and ipv4_ttl fields to just 'tos' and 'ttl', as they'll
be used with IPv6 tunnels, too.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip_tunnels.h include file uses mixture of __u16 and u16 (etc.) types.
Unify it to the non-underscore variants.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The custom alignment of struct ip_tunnel_key is unnecessary. In struct
sw_flow_key, it starts at offset 256, in struct ip_tunnel_info it's the
first field.
The structure is also packed even without the __packed keyword.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A proprietary vendor command may send back useful data to the user
application.
For example, the field level applied on the NFC router antenna.
Still based on net/wireless/nl80211.c implementation,
add nfc_vendor_cmd_alloc_reply_skb and nfc_vendor_cmd_reply in
order to send back over netlink data generated by a proprietary
command.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Some drivers needs to have ability to reinit NCI core, for example
after updating firmware in setup() of post_setup() callback. This
patch makes nci_core_reset() and nci_core_init() functions public,
to make it possible.
Signed-off-by: Robert Baldyga <r.baldyga@samsung.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Some drivers require non-standard configuration after NCI_CORE_INIT
request, because they need to know ndev->manufact_specific_info or
ndev->manufact_id. This patch adds post_setup handler allowing to do
such custom configuration.
Signed-off-by: Robert Baldyga <r.baldyga@samsung.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When CONFIG_LWTUNNEL config is not enabled, the lwtstate_free() is not
declared in lwtunnel.h at all. However, even in this case, the function
is still referenced in fib_semantics.c so that there appears the
following sparse warnings:
net/ipv4/fib_semantics.c:553:17: error: undefined identifier 'lwtstate_free'
CC net/ipv4/fib_semantics.o
net/ipv4/fib_semantics.c: In function ‘fib_encap_match’:
net/ipv4/fib_semantics.c:553:3: error: implicit declaration of function ‘lwtstate_free’ [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
make[1]: *** [net/ipv4/fib_semantics.o] Error 1
make: *** [net/ipv4/fib_semantics.o] Error 2
To eliminate the error, we define an empty function for lwtstate_free()
in lwtunnel.h when CONFIG_LWTUNNEL is disabled.
Fixes: df383e6240 ("lwtunnel: fix memory leak")
Cc: Jiri Benc <jbenc@redhat.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
230ac490f7 introduced a dependency to CONFIG_IPV6 which breaks bridging
of IPv6 packets on a bridge with CONFIG_IPV6=n.
Sysctl entry /proc/sys/net/bridge/bridge-nf-call-ip6tables defaults to 1,
for this reason packets are handled by br_nf_pre_routing_ipv6(). When compiled
with CONFIG_IPV6=n this function returns NF_DROP but should return NF_ACCEPT
to let packets through.
Change CONFIG_IPV6=n br_nf_pre_routing_ipv6() return value to NF_ACCEPT.
Tested with a simple bridge with two interfaces and IPv6 packets trying
to pass from host on left side to host on right side of the bridge.
Fixes: 230ac490f7 ("netfilter: bridge: split ipv6 code into separated file")
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_type_to_reg() needs to return the register in the new 32 bit addressing,
otherwise we hit EINVAL when using mappings.
Fixes: 49499c3 ("netfilter: nf_tables: switch registers to 32 bit addressing")
Reported-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
slave_queue has a num_slaves member which is unused, drop it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The built lwtunnel_state struct has to be freed after comparison.
Fixes: 571e722676 ("ipv4: support for fib route lwtunnel encap attributes")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an inline helper for determining is a port is a DSA port.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function updates a checksum field value and skb->csum based on
a value which is the difference between the old and new checksum.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates
the checksum field carries a pseudo header. This argument should be a
boolean instead of an int.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the capability to redirect dst input in the same way
that dst output is redirected by LWT.
Also, save the original dst.input and and dst.out when setting up
lwtunnel redirection. These can be called by the client as a pass-
through.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds the possibility of deriving the zone id from the skb->mark
field in a scalable manner. This allows for having only a single template
serving hundreds/thousands of different zones, for example, instead of the
need to have one match for each zone as an extra CT jump target.
Note that we'd need to have this information attached to the template as at
the time when we're trying to lookup a possible ct object, we already need
to know zone information for a possible match when going into
__nf_conntrack_find_get(). This work provides a minimal implementation for
a possible mapping.
In order to not add/expose an extra ct->status bit, the zone structure has
been extended to carry a flag for deriving the mark.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This work adds a direction parameter to netfilter zones, so identity
separation can be performed only in original/reply or both directions
(default). This basically opens up the possibility of doing NAT with
conflicting IP address/port tuples from multiple, isolated tenants
on a host (e.g. from a netns) without requiring each tenant to NAT
twice resp. to use its own dedicated IP address to SNAT to, meaning
overlapping tuples can be made unique with the zone identifier in
original direction, where the NAT engine will then allocate a unique
tuple in the commonly shared default zone for the reply direction.
In some restricted, local DNAT cases, also port redirection could be
used for making the reply traffic unique w/o requiring SNAT.
The consensus we've reached and discussed at NFWS and since the initial
implementation [1] was to directly integrate the direction meta data
into the existing zones infrastructure, as opposed to the ct->mark
approach we proposed initially.
As we pass the nf_conntrack_zone object directly around, we don't have
to touch all call-sites, but only those, that contain equality checks
of zones. Thus, based on the current direction (original or reply),
we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID.
CT expectations are direction-agnostic entities when expectations are
being compared among themselves, so we can only use the identifier
in this case.
Note that zone identifiers can not be included into the hash mix
anymore as they don't contain a "stable" value that would be equal
for both directions at all times, f.e. if only zone->id would
unconditionally be xor'ed into the table slot hash, then replies won't
find the corresponding conntracking entry anymore.
If no particular direction is specified when configuring zones, the
behaviour is exactly as we expect currently (both directions).
Support has been added for the CT netlink interface as well as the
x_tables raw CT target, which both already offer existing interfaces
to user space for the configuration of zones.
Below a minimal, simplified collision example (script in [2]) with
netperf sessions:
+--- tenant-1 ---+ mark := 1
| netperf |--+
+----------------+ | CT zone := mark [ORIGINAL]
[ip,sport] := X +--------------+ +--- gateway ---+
| mark routing |--| SNAT |-- ... +
+--------------+ +---------------+ |
+--- tenant-2 ---+ | ~~~|~~~
| netperf |--+ +-----------+ |
+----------------+ mark := 2 | netserver |------ ... +
[ip,sport] := X +-----------+
[ip,port] := Y
On the gateway netns, example:
iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL
iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully
iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark
iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark
conntrack dump from gateway netns:
netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns
tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1
src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438
[ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2
src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889
[ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2
Taking this further, test script in [2] creates 200 tenants and runs
original-tuple colliding netperf sessions each. A conntrack -L dump in
the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED
state as expected.
I also did run various other tests with some permutations of the script,
to mention some: SNAT in random/random-fully/persistent mode, no zones (no
overlaps), static zones (original, reply, both directions), etc.
[1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/
[2] https://paste.fedoraproject.org/242835/65657871/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Table lookup compiles out when VRF is not enabled.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-08-16
Here's what's likely the last bluetooth-next pull request for 4.3:
- 6lowpan/802.15.4 refactoring, cleanups & fixes
- Document 6lowpan netdev usage in Documentation/networking/6lowpan.txt
- Support for UART based QCA Bluetooth controllers
- Power management support for Broeadcom Bluetooth controllers
- Change LE connection initiation to always use passive scanning first
- Support for new Silicon Wave USB ID
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
a bit of content:
* mesh fixes/improvements from Alexis, Bob, Chun-Yeow and Jesse
* TDLS higher bandwidth support (Arik)
* OCB fixes from Bertold Van den Bergh
* suspend/resume fixes from Eliad
* dynamic SMPS support for minstrel-HT (Krishna Chaitanya)
* VHT bitrate mask support (Lorenzo Bianconi)
* better regulatory support for 5/10 MHz channels (Matthias May)
* basic support for MU-MIMO to avoid the multi-vif issue (Sara Sharon)
along with a number of other cleanups.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVzg5bAAoJEDBSmw7B7bqr3PAP/1r8wyZXxtySzz6P5Z9k0+2I
52NiSUISgmtnaQUyahf4n90eMU+gGJWQwPwIZFvMKg6bD4RW2XI4MdKmviKx8skU
4sDlDxMFrVMfV/ySwiPDAONWPtwwgKllIt0IDDnKs6kPdDlUcbKOTEFYhzZ1HhTZ
7Og4rJm7M90QpdMU7hmxmE5KRkp1hW0Yce1KPTW5U0j9yl9zbi4eLVWT+ac1WnZs
GpItajd0BFtBy7DRHzX8RiRJ4pi+aWxhuYNqiSxUm0BqPWCzT7PP15M1kCGwrXtm
/TTSVJl7WkLbOYI0PE0Y0XcJfZUg1c9aecCR3ubmRrQrGfOBFpN01jUANIRwqvZ3
3QRq1RZNLac0+zlBPjoFdOHmoaVX6UcJQKSgOhcfuM1BcNFnXZEcHFN4/SaEUfvJ
1ltybEeOEAckCMqqfHb1g/nVfJnlBjy811GzIrsHXqKqb7rRfGkfxmBxLrRzVknS
PC970pbuhxICeeryKdVgK5BClWeT3TB1srt6OZ0QR1zlcfZbLZ8jqJlHJcy3szFi
P43X9w8I6ZNTzkBU+lsCt9gbveYS+rSaJ+zm/SaF21ro33+FEdZ+p1ujjzp729Tz
PnKobaOrku38Be7CSwJ760WvngC7gbZqGybGknBsws4dqDXJste0UjxulZeyaOkN
nVmHDL45jc5rd8qjoPQV
=kV1a
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Another pull request for the next cycle, this time with quite
a bit of content:
* mesh fixes/improvements from Alexis, Bob, Chun-Yeow and Jesse
* TDLS higher bandwidth support (Arik)
* OCB fixes from Bertold Van den Bergh
* suspend/resume fixes from Eliad
* dynamic SMPS support for minstrel-HT (Krishna Chaitanya)
* VHT bitrate mask support (Lorenzo Bianconi)
* better regulatory support for 5/10 MHz channels (Matthias May)
* basic support for MU-MIMO to avoid the multi-vif issue (Sara Sharon)
along with a number of other cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2015-08-17
1) Fix IPv6 ECN decapsulation for IPsec interfamily tunnels.
From Thomas Egerer.
2) Use kmemdup instead of duplicating it in xfrm_dump_sa().
From Andrzej Hajda.
3) Pass oif to the xfrm lookups so that it gets set on the flow
and the resolver routines can match based on oif.
From David Ahern.
4) Add documentation for the new xfrm garbage collector threshold.
From Alexander Duyck.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.
inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.
Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.
Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a VRF_MASTER flag for interfaces and helper functions for determining
if a device is a VRF_MASTER.
Add link attribute for passing VRF_TABLE id.
Add vrf_ptr to netdevice.
Add various macros for determining if a device is a VRF device, the index
of the master VRF device and table associated with VRF device.
Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new functions in DSA drivers to access hardware VLAN entries through
SWITCHDEV_OBJ_PORT_VLAN objects:
- port_pvid_get() and vlan_getnext() to dump a VLAN
- port_vlan_del() to exclude a port from a VLAN
- port_pvid_set() and port_vlan_add() to join a port to a VLAN
The DSA infrastructure will ensure that each VLAN of the given range
does not already belong to another bridge. If it does, it will fallback
to software VLAN and won't program the hardware.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The iwlwifi driver was the only driver that used this, but as
it turns out it never needed it, so we can remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch introduced the 6lowpan netdev private data struct. We name it
lowpan_priv and it's placed at the beginning of netdev private data. All
lowpan interfaces should allocate this room at first of netdev private
data. 6LoWPAN LL private data can be allocate by additional netdev private
data, e.g. dev->priv_size should be "sizeof(struct lowpan_priv) +
sizeof(LL_LOWPAN_PRIVATE_DATA)".
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds an ndm_state member to the switchdev_obj_fdb structure,
in order to support static FDB addresses.
Set Rocker ndm_state to NUD_REACHABLE.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the prototype of port_getnext to include a vid parameter.
This is necessary to introduce the support for VLAN.
Also rename the fdb_{add,del,getnext} function pointers to
port_fdb_{add,del,getnext} since they are specific to a given port.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rules can be installed that direct route lookups to specific tables based
on oif. Plumb the oif through the xfrm lookups so it gets set in the flow
struct and passed to the resolver routines.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch replaces the zone id which is pushed down into functions
with the actual zone object. It's a bigger one-time change, but
needed for later on extending zones with a direction parameter, and
thus decoupling this additional information from all call-sites.
No functional changes in this patch.
The default zone becomes a global const object, namely nf_ct_zone_dflt
and will be returned directly in various cases, one being, when there's
f.e. no zoning support.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Support for sharing GREPROTO_CISCO port was added so that
OVS gre port and kernel GRE devices can co-exist. After
flow-based tunneling patches OVS GRE protocol processing
is completely moved to ip_gre module. so there is no need
for GRE protocol hook. Following patch consolidates
GRE protocol related functions into ip_gre module.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using GRE tunnel meta data collection feature, we can implement
OVS GRE vport. This patch removes all of the OVS
specific GRE code and make OVS use a ip_gre net_device.
Minimal GRE vport is kept to handle compatibility with
current userspace application.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch create new tunnel flag which enable
tunnel metadata collection on given device.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an explicit neighbour table overflow message (ratelimited) and
statistic to make diagnosing neighbour table overflows tractable in
the wild.
Diagnosing a neighbour table overflow can be quite difficult in the wild
because there is no explicit dmesg logged. Callers to neighbour code
seem to use net_dbg_ratelimit when the neighbour call fails which means
the "base message" is not emitted and the callback suppressed messages
from the ratelimiting can end-up juxtaposed with unrelated messages.
Further, a forced garbage collection will increment a stat on each call
whether it was successful in freeing-up a table entry or not, so that
statistic is only a hint. So, add a net_info_ratelimited message and
explicit statistic to the neighbour code.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.
This patch adds hci_connect_le_scan with dependencies, new method that
will be used to connect to remote LE devices. Instead of just sending
connect request, it adds a device to whitelist. Later patches will make
use of this whitelist to send conenct request when advertisement is
received, and properly handle timeouts.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds hci_lookup_le_connect method, that will be used to check
wether outgoing le connection attempt is in progress.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.
This patch adds some fields to hci_conn_params, in preparation to new
connect procedure.
explicit_connect will be used to override any current auto_connect action,
and connect to device when ad is received.
HCI_AUTO_CONN_EXPLICIT was added to auto_connect enum. When this value
will be used, explicit connect is the only action, and params can be
removed after successful connection.
HCI_CONN_SCANNING is added to hci_conn flags. When it's set, connect is
scan phase. It gets cleared when advertisement is received, and
HCI_OP_LE_CREATE_CONN is sent.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduce a new mib entry which isn't part of 802.15.4 but
useful as default behaviour to set the ack request bit or not if we
don't know if the ack request bit should set. This is currently used for
stacks like IEEE 802.15.4 6LoWPAN.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We currently supports multiple lowpan interfaces per wpan interface. I
never saw any use case into such functionality. We drop this feature now
because it's much easier do deal with address changes inside the under
laying wpan interface.
This patch removes the multiple lowpan interface and adds a lowpan_dev
netdev pointer into the wpan_dev, if this pointer isn't null the wpan
interface belongs to the assigned lowpan interface.
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Tested-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Remove the fdb_{add,del,getnext} function pointer in favor of new
port_fdb_{add,del,getnext}.
Implement the switchdev_port_obj_{add,del,dump} functions in DSA to
support the SWITCHDEV_OBJ_PORT_FDB objects.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a is_static boolean to the switchdev_obj_fdb structure,
in order to set the ndm_state to either NUD_NOARP or NUD_REACHABLE.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The address in the switchdev_obj_fdb structure is currently represented
as a pointer. Replacing it for a 6-byte array allows switchdev to carry
addresses directly read from hardware registers, not stored by the
switch chip driver (as in Rocker).
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IFLA_VXLAN_FLOWBASED is useless without IFLA_VXLAN_COLLECT_METADATA,
so combine them into single IFLA_VXLAN_COLLECT_METADATA flag.
'flowbased' doesn't convey real meaning of the vxlan tunnel mode.
This mode can be used by routing, tc+bpf and ovs.
Only ovs is strictly flow based, so 'collect metadata' is a better
name for this tunnel mode.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Move the nfnl_acct_list into the network namespace, initialize
and destroy it per namespace
- Keep track of refcnt on nfacct objects, the old logic does not
longer work with a per namespace list
- Adjust xt_nfacct to pass the namespace when registring objects
Signed-off-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.
Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.
Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) A couple of cleanups for the netfilter core hook from Eric Biederman.
2) Net namespace hook registration, also from Eric. This adds a dependency with
the rtnl_lock. This should be fine by now but we have to keep an eye on this
because if we ever get the per-subsys nfnl_lock before rtnl we have may
problems in the future. But we have room to remove this in the future by
propagating the complexity to the clients, by registering hooks for the init
netns functions.
3) Update nf_tables to use the new net namespace hook infrastructure, also from
Eric.
4) Three patches to refine and to address problems from the new net namespace
hook infrastructure.
5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
only applies to a very special case, the TEE target, but Eric Dumazet
reports that this is slowing down things for everyone else. So let's only
switch to the alternate jumpstack if the tee target is in used through a
static key. This batch also comes with offline precalculation of the
jumpstack based on the callchain depth. From Florian Westphal.
6) Minimal SCTP multihoming support for our conntrack helper, from Michal
Kubecek.
7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
Westphal.
8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.
9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize auto_flowlabels to one. This enables automatic flow labels,
individual socket may disable them using the IPV6_AUTOFLOWLABEL socket
option.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:
0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets
np->autoflowlabel is initialized according to the sysctl value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.
All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
skb: pointer to skb
key: pointer to 'struct bpf_tunnel_key'
size: size of 'struct bpf_tunnel_key'
flags: room for future extensions
First eBPF program that uses these helpers will allocate per_cpu
metadata_dst structures that will be used on TX.
On RX metadata_dst is allocated by tunnel driver.
Typical usage for TX:
struct bpf_tunnel_key tkey;
... populate tkey ...
bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
RX:
struct bpf_tunnel_key tkey = {};
bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
... lookup or redirect based on tkey ...
'struct bpf_tunnel_key' will be extended in the future by adding
elements to the end and the 'size' argument will indicate which fields
are populated, thereby keeping backwards compatibility.
The 'flags' argument may be used as well when the 'size' is not enough or
to indicate completely different layout of bpf_tunnel_key.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-07-30
Here's a set of Bluetooth & 802.15.4 patches intended for the 4.3 kernel.
- Cleanups & fixes to mac802154
- Refactoring of Intel Bluetooth HCI driver
- Various coding style fixes to Bluetooth HCI drivers
- Support for Intel Lightning Peak Bluetooth devices
- Generic class code in interface descriptor in btusb to match more HW
- Refactoring of Bluetooth HS code together with a new config option
- Support for BCM4330B1 Broadcom UART controller
Let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 55334a5db5 ("net_sched: act: refuse to remove bound action
outside"), we end up with a wrong reference count for a tc action.
Test case 1:
FOO="1,6 0 0 4294967295,"
BAR="1,6 0 0 4294967294,"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
action bpf bytecode "$FOO"
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 1 bind 1
tc actions replace action bpf bytecode "$BAR" index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
index 1 ref 2 bind 1
tc actions replace action bpf bytecode "$FOO" index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 3 bind 1
Test case 2:
FOO="1,6 0 0 4294967295,"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 1 bind 1
tc actions add action drop index 1
RTNETLINK answers: File exists [...]
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 2 bind 1
tc actions add action drop index 1
RTNETLINK answers: File exists [...]
tc actions show action gact
action order 0: gact action pass
random type none pass val 0
index 1 ref 3 bind 1
What happens is that in tcf_hash_check(), we check tcf_common for a given
index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
found an existing action. Now there are the following cases:
1) We do a late binding of an action. In that case, we leave the
tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
handler. This is correctly handeled.
2) We replace the given action, or we try to add one without replacing
and find out that the action at a specific index already exists
(thus, we go out with error in that case).
In case of 2), we have to undo the reference count increase from
tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
do so because of the 'tcfc_bindcnt > 0' check which bails out early with
an -EPERM error.
Now, while commit 55334a5db5 prevents 'tc actions del action ...' on an
already classifier-bound action to drop the reference count (which could
then become negative, wrap around etc), this restriction only accounts for
invocations outside a specific action's ->init() handler.
One possible solution would be to add a flag thus we possibly trigger
the -EPERM ony in situations where it is indeed relevant.
After the patch, above test cases have correct reference count again.
Fixes: 55334a5db5 ("net_sched: act: refuse to remove bound action outside")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Any external user should use the registration API instead of
accessing this directly.
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a connection is failing a transport protocol calls
dst_negative_advice to try to get a better route. This patch includes
changing the sk_txhash in that function. This provides a rudimentary
method to try to find a different path in the network since sk_txhash
affects ECMP on the local host and through the network (via flow labels
or UDP source port in encapsulation).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates sk_set_txhash and eliminates protocol specific
inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a
random number instead of performing flow dissection. sk_set_txash
is also allowed to be called multiple times for the same socket,
we'll need this when redoing the hash for negative routing advice.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.
sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.
Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz <rh-bugzilla@ensc.de>
Reported-by: Dan Searle <dan@censornet.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
num_grat_arp wasn't converted to the new bonding option API, so do this
now and remove the specific sysfs store option in order to use the
standard one. num_grat_arp is the same as num_unsol_na so add it as an
alias with the same option settings. An important difference is the option
name which is matched in bond_sysfs_store_option().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It saves some lines and simplify a bit the code when the state is returning
by this function. It's also useful to handle a NULL entry.
To avoid too long lines, I've also renamed lwtunnel_state_get() and
lwtunnel_state_put() to lwtstate_get() and lwtstate_put().
CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
race conditions with the evictor and use a participation test for the
evictor list, when we're at that point (after inet_frag_kill) in the
timer there're 2 possible cases:
1. The evictor added the entry to its evictor list while the timer was
waiting for the chainlock
or
2. The timer unchained the entry and the evictor won't see it
In both cases we should be able to see list_evictor correctly due
to the sync on the chainlock.
Joint work with Florian Westphal.
Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Followup patch will call it after inet_frag_queue was freed, so q->net
doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).
Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 65ba1f1ec0 ("inet: frags: fix a race between inet_evict_bucket
and inet_frag_kill") describes the bug, but the fix doesn't work reliably.
Problem is that ->flags member can be set on other cpu without chainlock
being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
bit after we put the element on the evictor private list.
We can crash when walking the 'private' evictor list since an element can
be deleted from list underneath the evictor.
Join work with Nikolay Alexandrov.
Fixes: b13d3cbfb8 ("inet: frag: move eviction of queues to work queue")
Reported-by: Johan Schuijt <johan@transip.nl>
Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains ten Netfilter/IPVS fixes, they are:
1) Address refcount leak when creating an expectation from the ctnetlink
interface.
2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry
Torokhov.
3) Resolve panic for unreachable route in IPVS with locally generated
traffic in the output path, from Alex Gartrell.
4) Fix wrong source address in rare cases for tunneled traffic in IPVS,
from Julian Anastasov.
5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian.
6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from
Alex Gartrell.
7) Fix crash with IPVS sync protocol v0 and FTP, from Julian.
8) Reset sender cpu for forwarded traffic in IPVS, also from Julian.
9) Allocate template conntracks through kmalloc() to resolve netns dependency
problems with the conntrack kmem_cache.
10) Fix zones with expectations that clash using the same tuple, from Joe
Stringer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_select_default considers alternative routes only when
res->fi is for the first alias in res->fa_head. In the
common case this can happen only when the initial lookup
matches the first alias with highest TOS value. This
prevents the alternative routes to require specific TOS.
This patch solves the problem as follows:
- routes that require specific TOS should be returned by
fib_select_default only when TOS matches, as already done
in fib_table_lookup. This rule implies that depending on the
TOS we can have many different lists of alternative gateways
and we have to keep the last used gateway (fa_default) in first
alias for the TOS instead of using single tb_default value.
- as the aliases are ordered by many keys (TOS desc,
fib_priority asc), we restrict the possible results to
routes with matching TOS and lowest metric (fib_priority)
and routes that match any TOS, again with lowest metric.
For example, packet with TOS 8 can not use gw3 (not lowest
metric), gw4 (different TOS) and gw6 (not lowest metric),
all other gateways can be used:
tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
tos 8 via gw2 metric 2
tos 8 via gw3 metric 3
tos 4 via gw4
tos 0 via gw5
tos 0 via gw6 metric 1
Reported-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Slave latency range has been changed in Core Spec. 4.2 by Erratum 5419
of ESR08_V1.0.0. And it should be applied to Core Spec. 4.0 and 4.1.
Before:
connSlaveLatency <= ((connSupervisionTimeout / connIntervalMax) - 1)
After:
connSlaveLatency <= ((connSupervisionTimeout / (connIntervalMax*2)) - 1)
This patch makes hci_check_conn_params() check the allowable slave
latency range using the changed way.
Signed-off-by: Seungyoun Ju <sy39.ju@samsung.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Add a timeout to prevent the do while loop running in an
infinite loop. This ensures that the channel will be
instructed to close within 10 seconds so prevents
l2cap_sock_shutdown() getting stuck forever.
Returns -ENOLINK when the timeout is reached. The channel
will be subequently closed and not all data will be ACK'ed.
Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Use msecs_to_jiffies() instead of using HZ so that it
is easier to specify the time in milliseconds.
Also add a #define L2CAP_WAIT_ACK_POLL_PERIOD to specify the 200ms
polling period so that it is defined in a single place.
Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Right now there are no other users for ieee802154_rx()
in kernel. So lets remove EXPORT_SYMBOL() for this.
Also it moves the function prototype from global header
file to local header file.
Signed-off-by: Varka Bhadram <varkabhadram@gmail.com>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch help to implement suspend/resume in mac802154, these
hooks will be run before the device is suspended and after it
resumes.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Convert the module_init() to a invocation from inet_init() since
ip_tunnel_core is part of the INET built-in.
Fixes: 3093fbe7ff ("route: Per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bridge/br_mdb.c
br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Account for the configuration FIB_RULES=y && INET=n as FIB_RULES can
be selected by IPV6 or DECNET without INET.
Fixes: e7030878fc ("fib: Add fib rule match on tunnel id")
Fixes: 3093fbe7ff ("route: Per route IP tunnel metadata via lightweight tunnel")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is
enabled. #ifdefify it to reduce the size of struct sock on 32 bit
systems, at least.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.
Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.
ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200
A new static key controls the collection of metadata at tunnel level
upon demand.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
allow routes to match on tunnel metadata. For now, the tunnel id is
added to flowi_tunnel which allows for routes to be bound to specific
virtual tunnels.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.
Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst. The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.
This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.
It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.
Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.
The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.
The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.
In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.
The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.
Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implementation uses lwtunnel infrastructure to register
hooks for mpls tunnel encaps.
It picks cues from iptunnel_encaps infrastructure and previous
mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces lwtunnel_output function to call corresponding
lwtunnels output function to xmit the packet.
It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
ipv6 respectively today. But this is subject to change when lwtstate will
reside in dst or dst_metadata (as per upstream discussions).
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support in ipv6 fib functions to parse Netlink
RTA encap attributes and attach encap state data to rt6_info.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support in ipv4 fib functions to parse user
provided encap attributes and attach encap state data to fib_nh
and rtable.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provides infrastructure to parse/dump/store encap information for
light weight tunnels like mpls. Encap information for such tunnels
is associated with fib routes.
This infrastructure is based on previous suggestions from
Eric Biederman to follow the xfrm infrastructure.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Arik introduced an rtnl-locked regulatory API to be able
to differentiate between place do/don't have the RTNL;
this fixes missing locking in some of the code paths
2. Two small mesh bugfixes from Bob, one to avoid treating
a certain malformed over-the-air frame and one to avoid
sending a garbage field over the air.
3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.
4. A fix for a powersave vs. aggregation teardown race, from Michal.
5. Thomas reduced the loglevel of CRDA messages to avoid spamming
the kernel log with mostly irrelevant information.
6. Tom fixed a dangling debugfs directory pointer that could cause
crashes if subsequent addition of the same interface to debugfs
failed for some reason.
7. A fix from myself for a list corruption issue in mac80211 during
combined interface shutdown/removal - shut down interfaces first
and only then remove them to avoid that.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVqQMwAAoJEDBSmw7B7bqrZzkQAIjMKojlJRouN/N/aF7ym2pC
eAboLMC+XHubQoq2H01k5ZOSrLL1kElhkB7pLas+q00gTFyavLzEcEiFqCNuLwPH
lQEwLXTDUeiaVWekOYJev/ONtaDdwUXoB4BPAA3Ih4EAk9fEBtcWiWeLDgOLOS8P
eYVqcMV733cOTjhYImEQnhnm3qrcwSCF1vTOJaN4Gf/qqw6j2ilq5wU1TvPyh0TA
1EP5Elb9hy1sud5X6shrsOBqkBrPoO1p3V4EeoHkxl8welqxXdqGvmA3K0sloGZT
7RiL8PD4QVyISy1NrBDnNMRRgj6BD1aLC+clmECmmgYvGGcqbzLtB3CWUCV6oQmb
TC4NmgJkKNVTvQnoqxQEL8JYSs/E2ITRKyMi3sfIYAyz1dVuQf1RkZZzB22rQWT2
PaLq/k+vpS7E3OD3XB53flB/k7Y6j/OwJb/rE7i2vqSn3kcbua8H7dzd7p+AE5FA
ZF//u2GBDgZeMaA9BvifByWy2+yvAEcD5/U9XkWqJ7t+HohKteLJj/scHT89pto3
n0NZ7RVRMNQ9mz14UJiVnFOL/81AjmiU123S5UIIMkmVE5Zrn7VTZlN6fVY4Fcsh
AtxHQesOlCw8T4lFLxgyKkEl7bxATQ2OMR6vWsZQraRHSqIuK8JDABRokIlzoFn/
xC/Yn1vTaBuj+2nif/F0
=US5Y
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Some fixes for the current cycle:
1. Arik introduced an rtnl-locked regulatory API to be able
to differentiate between place do/don't have the RTNL;
this fixes missing locking in some of the code paths
2. Two small mesh bugfixes from Bob, one to avoid treating
a certain malformed over-the-air frame and one to avoid
sending a garbage field over the air.
3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.
4. A fix for a powersave vs. aggregation teardown race, from Michal.
5. Thomas reduced the loglevel of CRDA messages to avoid spamming
the kernel log with mostly irrelevant information.
6. Tom fixed a dangling debugfs directory pointer that could cause
crashes if subsequent addition of the same interface to debugfs
failed for some reason.
7. A fix from myself for a list corruption issue in mac80211 during
combined interface shutdown/removal - shut down interfaces first
and only then remove them to avoid that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
unique for device and may even be unique for a sub-set of ports within
device, so add switchdev helper function to generate unique marks based on
port's switch ID and group_ifindex. group_ifindex would typically be the
container dev's ifindex, such as the bridge's ifindex.
The generator uses a global hash table to store offload_fwd_marks hashed by
{switch ID, group_ifindex} key.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split out retrieving the cgroups net_cls classid retrieval into its
own function, so that it can be reused later on from other parts of
the traffic control subsystem. If there's no skb->sk, then the small
helper returns 0 as well, which in cls_cgroup terms means 'could not
classify'.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Daniel Borkmann:
"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.
Minimal example:
ip netns add foo
ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
ip netns del foo
What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.
Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net->ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.
This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."
Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.
Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.
Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
This callback is currently not allowed to sleep, which makes it more
difficult to implement proper driver methods in mac80211 than it has
to be. Instead of doing asynchronous work here in mac80211, make it
possible for the callback to sleep by doing some asynchronous work
in cfg80211. This also enables improvements to other drivers, like
ath6kl, that would like to sleep in this callback.
While at it, also fix the code to call the driver on the implicit
unregistration when an interface is removed, and do that also when
a P2P-Device wdev is destroyed (otherwise we leak the structs.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some drivers may need to store data per key, for example for PN
validation. Allow this by adding a pointer to the struct that
the driver can assign.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow a device to specify support for the TDLS wider-bandwidth feature.
Indicate this support during TDLS setup in the ext-capab IE and set an
appropriate station flag when our TDLS peer supports it.
This feature gives TDLS peers the ability to use a wider channel than
the base width of the BSS. For instance VHT capable TDLS peers connected
on a 20MHz channel can extend the channel to 80MHz, if regulatory
considerations allow it.
Do not cap the bandwidth of such stations by the current BSS channel width
in mac80211.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When there are multiple RX queues, the PN checks in mac80211 cannot be
used since packets might be processed out of order on different CPUs.
Allow the driver to report that the PN has been checked, drivers that
will use multi-queue RX will have to set this flag.
For now, the flag is only valid when the frame has been decrypted, in
theory that restriction doesn't have to be there, but in practice the
hardware will have decrypted the frame already.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
As there's no driver using this capability and reporting zero-length
A-MPDU subframes for radiotap monitoring, remove the capability to
free up two RX flags.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When introducing multiple RX queues, a single NAPI struct will not
be sufficient. Instead of trying to store multiple, simply change
the API to have the NAPI struct passed to the RX function. This of
course means that drivers using rx_irqsafe() cannot use NAPI, but
that seems a reasonable trade-off, particularly since only two of
all drivers are currently using it at all.
While at it, we can now remove the IEEE80211_RX_REORDER_TIMER flag
again since this code path cannot have a napi struct anyway.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The RTNL is required to check for IR-relaxation conditions that allow
more channels to beacon. Export an RTNL locked version of reg_can_beacon
and use it where possible in AP/STA interface type flows, where
IR-relaxation may be applicable.
Fixes: 06f207fc54 ("cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA")
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ip6_datagram_connect() is doing a lot of socket changes without
socket being locked.
This looks wrong, at least for udp_lib_rehash() which could corrupt
lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add a new set of functions for registering and unregistering per
network namespace hooks.
- Modify the old global namespace hook functions to use the per
network namespace hooks in their implementation, so their remains a
single list that needs to be walked for any hook (this is important
for keeping the hook priority working and for keeping the code
walking the hooks simple).
- Only allow registering the per netdevice hooks in the network
namespace where the network device lives.
- Dynamically allocate the structures in the per network namespace
hook list in nf_register_net_hook, and unregister them in
nf_unregister_net_hook.
Dynamic allocate is required somewhere as the number of network
namespaces are not fixed so we might as well allocate them in the
registration function.
The chain of registered hooks on any list is expected to be small so
the cost of walking that list to find the entry we are unregistering
should also be small.
Performing the management of the dynamically allocated list entries
in the registration and unregistration functions keeps the complexity
from spreading.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.
This add the ip_nonlocal_bind sysctl under ipv6.
Testing:
Set up nonlocal binding and receive routing on a host, e.g.:
ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1
Set up routing to 2001:0:0:1::/64 on peer to go to first host
ping6 -I 2001:0:0:1::1 peer-address -- to verify
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_twsk_deschedule() calls are followed by inet_twsk_put().
Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.
Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()
In particular, deferred inet_twsk_put() added in commit
13475a30b6 ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel will crash the same if one of the pointer is NULL anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like act_gact, act_mirred can be lockless in packet processing
1) Use percpu stats
2) update lastuse only every clock tick to avoid false sharing
3) use rcu to protect tcfm_dev
4) Remove spinlock usage, as it is no longer needed.
Next step : add multi queue capability to ifb device
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Final step for gact RCU operation :
1) Use percpu stats
2) update lastuse only every clock tick to avoid false sharing
3) Remove spinlock acquisition, as it is no longer needed.
Since this is the last contended lock in packet RX when tc gact is used,
this gives impressive gain.
My host with 8 RX queues was handling 5 Mpps before the patch,
and more than 11 Mpps after patch.
Tested:
On receiver :
dev=eth0
tc qdisc del dev $dev ingress 2>/dev/null
tc qdisc add dev $dev ingress
tc filter del dev $dev root pref 10 2>/dev/null
tc filter del dev $dev pref 10 2>/dev/null
tc filter add dev $dev est 1sec 4sec parent ffff: protocol ip prio 1 \
u32 match ip src 7.0.0.0/8 flowid 1:15 action drop
Sender sends packets flood from 7/8 network
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Second step for gact RCU operation :
We want to get rid of the spinlock protecting gact operations.
Stats (packets/bytes) will soon be per cpu.
gact_determ() would not work without a central packet counter,
so lets add it for this mode.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reuse existing percpu infrastructure John Fastabend added for qdisc.
This patch adds a new cpustats parameter to tcf_hash_create() and all
actions pass false, meaning this patch should have no effect yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_bstats_update_cpu() and other helpers were added to support
percpu stats for qdisc.
We want to add percpu stats for tc action, so this patch add common
helpers.
qdisc_bstats_update_cpu() is renamed to qdisc_bstats_cpu_update()
qdisc_qstats_drop_cpu() is renamed to qdisc_qstats_cpu_drop()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
Conflicts:
drivers/net/ethernet/mellanox/mlx4/main.c
net/packet/af_packet.c
Both conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
This feature is only enabled with the new per-interface or ipv4 global
sysctls called 'ignore_routes_with_linkdown'.
net.ipv4.conf.all.ignore_routes_with_linkdown = 0
net.ipv4.conf.default.ignore_routes_with_linkdown = 0
net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
...
When the above sysctls are set, will report to userspace that a route is
dead and will no longer resolve to this nexthop when performing a fib
lookup. This will signal to userspace that the route will not be
selected. The signalling of a RTNH_F_DEAD is only passed to userspace
if the sysctl is enabled and link is down. This was done as without it
the netlink listeners would have no idea whether or not a nexthop would
be selected. The kernel only sets RTNH_F_DEAD internally if the
interface has IFF_UP cleared.
With the new sysctl set, the following behavior can be observed
(interface p8p1 is link-down):
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1
cache
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15
cache
While the route does remain in the table (so it can be modified if
needed rather than being wiped away as it would be if IFF_UP was
cleared), the proper next-hop is chosen automatically when the link is
down. Now interface p8p1 is linked-up:
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2
90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1
cache
local 80.0.0.1 dev lo src 80.0.0.1
cache <local>
80.0.0.2 dev p8p1 src 80.0.0.1
cache
and the output changes to what one would expect.
If the sysctl is not set, the following output would be expected when
p8p1 is down:
default via 10.0.5.2 dev p9p1
10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15
70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1
80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown
90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown
90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2
Since the dead flag does not appear, there should be no expectation that
the kernel would skip using this route due to link being down.
v2: Split kernel changes into 2 patches, this actually makes a
behavioral change if the sysctl is set. Also took suggestion from Alex
to simplify code by only checking sysctl during fib lookup and
suggestion from Scott to add a per-interface sysctl.
v3: Code clean-ups to make it more readable and efficient as well as a
reverse path check fix.
v4: Drop binary sysctl
v5: Whitespace fixups from Dave
v6: Style changes from Dave and checkpatch suggestions
v7: One more checkpatch fixup
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a fib flag called RTNH_F_LINKDOWN to any ipv4 nexthops that are
reachable via an interface where carrier is off. No action is taken,
but additional flags are passed to userspace to indicate carrier status.
This also includes a cleanup to fib_disable_ip to more clearly indicate
what event made the function call to replace the more cryptic force
option previously used.
v2: Split out kernel functionality into 2 patches, this patch simply
sets and clears new nexthop flag RTNH_F_LINKDOWN.
v3: Cleanups suggested by Alex as well as a bug noticed in
fib_sync_down_dev and fib_sync_up when multipath was not enabled.
v5: Whitespace and variable declaration fixups suggested by Dave.
v6: Style fixups noticed by Dave; ran checkpatch to be sure I got them
all.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use vid_begin/end to be consistent with BRIDGE_VLAN_INFO_RANGE_BEGIN/END.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-06-18
Here's the final bluetooth-next pull request for 4.2.
- Cleanups & fixes to 802.15.4 code and related drivers
- Fix btusb driver memory leak
- New USB IDs for Atheros controllers
- Support for BCM4324B3 UART based Broadcom controller
- Fix for Bluetooth encryption key size handling
- Broadcom controller initialization fixes
- Support for Intel controller DDC parameters
- Support for multiple Bluetooth LE advertising instances
- Fix for HCI user channel cleanup path
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.2:
API:
- Convert RNG interface to new style.
- New AEAD interface with one SG list for AD and plain/cipher text.
All external AEAD users have been converted.
- New asymmetric key interface (akcipher).
Algorithms:
- Chacha20, Poly1305 and RFC7539 support.
- New RSA implementation.
- Jitter RNG.
- DRBG is now seeded with both /dev/random and Jitter RNG. If kernel
pool isn't ready then DRBG will be reseeded when it is.
- DRBG is now the default crypto API RNG, replacing krng.
- 842 compression (previously part of powerpc nx driver).
Drivers:
- Accelerated SHA-512 for arm64.
- New Marvell CESA driver that supports DMA and more algorithms.
- Updated powerpc nx 842 support.
- Added support for SEC1 hardware to talitos"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (292 commits)
crypto: marvell/cesa - remove COMPILE_TEST dependency
crypto: algif_aead - Temporarily disable all AEAD algorithms
crypto: af_alg - Forbid the use internal algorithms
crypto: echainiv - Only hold RNG during initialisation
crypto: seqiv - Add compatibility support without RNG
crypto: eseqiv - Offer normal cipher functionality without RNG
crypto: chainiv - Offer normal cipher functionality without RNG
crypto: user - Add CRYPTO_MSG_DELRNG
crypto: user - Move cryptouser.h to uapi
crypto: rng - Do not free default RNG when it becomes unused
crypto: skcipher - Allow givencrypt to be NULL
crypto: sahara - propagate the error on clk_disable_unprepare() failure
crypto: rsa - fix invalid select for AKCIPHER
crypto: picoxcell - Update to the current clk API
crypto: nx - Check for bogus firmware properties
crypto: marvell/cesa - add DT bindings documentation
crypto: marvell/cesa - add support for Kirkwood and Dove SoCs
crypto: marvell/cesa - add support for Orion SoCs
crypto: marvell/cesa - add allhwsupport module parameter
crypto: marvell/cesa - add support for all armada SoCs
...
Struct inet_proto no longer exists, so update the
comment which is out of date.
Signed-off-by: Zhaowei Yuan <zhaowei.yuan@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge the mvebu/drivers branch of the arm-soc tree which contains
just a single patch bfa1ce5f38 ("bus:
mvebu-mbus: add mv_mbus_dram_info_nooverlap()") that happens to be
a prerequisite of the new marvell/cesa crypto driver.
This pulls the full hook netfilter definitions from all those that include
net_namespace.h.
Instead let's just include the bare minimum required in the new
linux/netfilter_defs.h file, and use it from the netfilter netns header files.
I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
this compilation error:
In file included from include/linux/netfilter_defs.h:4:0,
from include/net/netns/netfilter.h:4,
from include/net/net_namespace.h:22,
from include/linux/netdevice.h:43,
from net/netfilter/nfnetlink_queue_core.c:23:
include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;
And also explicit include linux/netfilter.h in several spots.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
We don't need to pull the full definitions in that file, a simple forward
declaration is enough.
Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
compilation error due to missing declarations, ie.
net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat,
^
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Include linux/idr.h and linux/skbuff.h since they are required by objects that
are declared in the net structure.
struct net {
...
struct idr netns_ids;
...
struct sk_buff_head wext_nlevents;
...
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Resolve compilation breakage when CONFIG_IPV6 is not set by moving the IPv6
code into a separated br_netfilter_ipv6.c file.
Fixes: efb6de9b4b ("netfilter: bridge: forward IPv6 fragmented packets")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that all preconditions are present for actual multi-advertising, the
number of allowed advertising instances can be larger than one. This
patch increases the number of allowed advertising instances to 5.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Now that the obsolete adv_instance is no longer being referenced
anywhere in the code it can be removed without breaking the build.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The add_advertising() and add_advertising_complete() functions reference
the now obsolete hdev->adv_instance struct. Both methods are being
refactored to access the dynamic advertising instance list instead.
This patch also introduces all logic necessary to actually deal with
multiple instance advertising. Notably the mgmt_adv_inst_expired() and
schedule_adv_inst() method are being referenced to schedule instances in
a round robin fashion.
This patch also introduces a "pending" flag into the adv_info struct.
This is necessary to identify and remove recently added advertising
instances when the HCI commands return with an error status code.
Otherwise new advertising instances could be leaked without properly
informing userspace about their existence.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently the delayed work managing advertising duration and timeout is
part of the advertising instance structure. This is not correct as only
a single instance can be advertised at any given time. To implement
round robin advertising a single delayed work structure is needed.
To fix this the delayed work structure is being moved to the hci_dev
structure. The instance specific variable is renamed to "remaining_time"
to make it clear that this is the remaining lifetime of the instance and
not the current advertising timeout.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The current hci dev structure only supports a single advertising
instance. To support multi-instance advertising it is necessary to
introduce a linked list of advertising instances so that multiple
advertising instances can be dynamically added and/or removed.
In a first step, the existing adv_instance member of the hci_dev
struct is supplemented by a linked list of advertising instances.
This patch introduces the list and supporting list management
infrastructure. The list is not being used yet.
Signed-off-by: Florian Grandel <fgrandel@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
These groups will contain socket-destruction events for
AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP.
Near the end of socket destruction, a check for listeners is
performed. In the presence of a listener, rather than completely
cleanup the socket, a unit of work will be added to a private
work queue which will first broadcast information about the socket
and then finish the cleanup operation.
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the NFC pull request for 4.2.
- NCI drivers can now define their own handlers for processing
proprietary NCI responses and notifications.
- NFC vendors can use a dedicated netlink API to send their own
proprietary commands, like e.g. all commands needed to implement
vendor specific manufacturing tools.
- A new generic NCI over UART driver against which any NCI chipset
running on top of a serial interface can register.
- The st21nfcb driver is renamed to st-nci as it can and will support
most of ST Microelectronics NCI chipsets.
- The st21nfcb driver can put its CLF in hibernate mode and save
significant amount of power.
- A few st21nfcb minor fixes.
- The NXP NCI driver now supports ACPI enumeration.
- The Marvell NCI driver now supports both USB and serial
physical interfaces.
- The Marvell NCI drivers also supports NCI frames being muxed
over HCI. This is a setting that can be defined by a DT property.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVe2k1AAoJEIqAPN1PVmxKt88P/11r4VVq57Kh4BHKTa6CAs5m
FBMmQdGlpU+O8VUjQLl7Y+GWoVTmDOUm/uKd6xWIvgBl4/X+UKwhNVldpsWXvw1t
cTnn0BykvvfA4FOQJBqgGTC38oC04REr4uTK3+NnjE6sBpQ86Ljfk7xarMKdDpKI
TKchY4sIOHIHXHVPVjW4fyEVF6pUduTenH73zWPkyKZgOQaSbgR75j12WMEI20kw
POykX9t6UcPDzcJ+doktUgfxhHB1YlML7Z5xNrIkbk4kyAj140Ds9mzEEBllyivc
xX1Cy6h3S+vrnx/CnNa85nA/y7cH0oK8NVQwmYicqKT/W7wASF7JZYLT6A7E81c1
zLGHWVEK0awS/KM+bLI3ixQVNWFDqYPM8R36Ag0NotUZJPKiHRQkyBU3VSS8XT2f
HRBlYgNYSKfR1m9LCPtm+sq6psP7cnvRAfzDrZuWtyqctriZKqCuOXbVAHJtb/3s
gHxMkfyTL1zRW7u/xGid15JiicNAJd2aQmkHzZJCnIgUyTrnvhSNX6sRjBZ6iQCf
z3+aTIaNEApOZ2qtfOrnSTtW0wjOBdpUK3hlSX5ozQ+V6dspoN0ld1JWURQPqkzu
jWwCONO29/9yuoc/qTbWPGL4avAqyCFEzhAk/Ux19jIqoXNzHVF9f4HICc1aRLTf
TKpcrQPYB61fU07CV9uT
=+IG2
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.2 pull request
This is the NFC pull request for 4.2.
- NCI drivers can now define their own handlers for processing
proprietary NCI responses and notifications.
- NFC vendors can use a dedicated netlink API to send their own
proprietary commands, like e.g. all commands needed to implement
vendor specific manufacturing tools.
- A new generic NCI over UART driver against which any NCI chipset
running on top of a serial interface can register.
- The st21nfcb driver is renamed to st-nci as it can and will support
most of ST Microelectronics NCI chipsets.
- The st21nfcb driver can put its CLF in hibernate mode and save
significant amount of power.
- A few st21nfcb minor fixes.
- The NXP NCI driver now supports ACPI enumeration.
- The Marvell NCI driver now supports both USB and serial
physical interfaces.
- The Marvell NCI drivers also supports NCI frames being muxed
over HCI. This is a setting that can be defined by a DT property.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.
This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.
This patch reworks ebddf1a8d7 ("netfilter: nf_tables: allow to bind table to
net_device").
Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
->auto_asconf_splist is per namespace and mangled by functions like
sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
Also, the call to inet_sk_copy_descendant() was backuping
->auto_asconf_list through the copy but was not honoring
->do_auto_asconf, which could lead to list corruption if it was
different between both sockets.
This commit thus fixes the list handling by using ->addr_wq_lock
spinlock to protect the list. A special handling is done upon socket
creation and destruction for that. Error handlig on sctp_init_sock()
will never return an error after having initialized asconf, so
sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
will be take on sctp_close_sock(), before locking the socket, so we
don't do it in inverse order compared to sctp_addr_wq_timeout_handler().
Instead of taking the lock on sctp_sock_migrate() for copying and
restoring the list values, it's preferred to avoid rewritting it by
implementing sctp_copy_descendant().
Issue was found with a test application that kept flipping sysctl
default_auto_asconf on and off, but one could trigger it by issuing
simultaneous setsockopt() calls on multiple sockets or by
creating/destroying sockets fast enough. This is only triggerable
locally.
Fixes: 9f7d653b67 ("sctp: Add Auto-ASCONF support (core).")
Reported-by: Ji Jianwen <jiji@redhat.com>
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes commits ("mac802154: cleanup ieee802154 hardware flags")
bcbfd2078d and ("mac802154: cleanup address
filtering flags") 6b70a43c7e by starting
the flags definitions at BIT(0) which is same like the previous behaviour.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the setting of llsec param flag with the BIT macro.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since Bluetooth 3.0 there's a HCI command available for reading the
encryption key size of an BR/EDR connection. This information is
essential e.g. for generating an LTK using SMP over BR/EDR, so store
it as part of struct hci_conn.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In commit cd7d8498c9 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.
This patch does the same for gso_size.
This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some NFC controller supports UART as host interface.
As with SPI, a lot of code can be shared between vendor
drivers. This patch add the generic support of UART and
provides some extension API for vendor specific needs.
This code is strongly inspired by the Bluetooth HCI ldisc
implementation. NCI UART vendor drivers will have to register
themselves to this layer via nci_uart_register.
Underlying tty will have to be configured from user land
thanks to an ioctl.
Signed-off-by: Vincent Cuissard <cuissard@marvell.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* mesh fixes from Alexis Green and Chun-Yeow Yeoh,
* a documentation fix from Jakub Kicinski,
* a missing channel release (from Michal Kazior),
* a fix for a signal strength reporting bug (from Sara Sharon),
* handle deauth while associating (myself),
* don't report mangled TX SKB back to userspace for status (myself),
* handle aggregation session timeouts properly in fast-xmit (myself)
However, there are also a few cleanups and one big change that
affects all drivers (and that required me to pull in your tree)
to change the mac80211 HW flags to use an unsigned long bitmap
so that we can extend them more easily - we're running out of
flags even with a cleanup to remove the two unused ones.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVeEQ1AAoJEDBSmw7B7bqrebcP/3v7I2ZXAeHag2W4hdD4YH6W
tuKfs3JKW3GDh84l2AJs2JBpFxR6Tk0Z7zGKrPLzkBTkkJkSLgKuUKR0+YQU6PYH
VfZ2NkdIHEqouLgMWxGGlp6suqp2yYD9tiIUroICXZ6aFm5trQuZgzv5ePI+lhmX
cWUYCawE2tcpVdg0NJsFExeCJhw81e/Bet1LCGHo0asWNpIK7phMdltzD7e4tgQS
4q475FCIkWxbxKgJRrRkz8J7grsjK1wf2W3acOxKMaoVBeqJVW5BWDrTgo0aDPts
qQ8n8t1s9o/jKQIvaz3RyjkQgX8T4vCMqkouLF4jJOThKIsUSi3Fvm9oKcMg4YhA
Ju5QWfbCBFhpLZeBzWzKyePTnDru1XDFFVdIATLONKTVg1modzFAs3j5gb4Z3Wtg
VYLoLWWpRtHKd9pzfZMhyWq64Xb8C+qlyQHr4r4QRm9ADz0Jq+OCh0rTFt+/bncM
CHxnf0VS9hEOFk0+TxFqi2yXOnv2uMgcN+jnGkEs4QuLfv9ML1Eb23ZjDoHxd1uq
1Yd4R8IDEY/KU6UJMwksz+gV/ekoB32eAhw56pxehgAMuZL4OgNvmeAQHx7Jq9it
0/OfAK2BSNH8odqYQbpg89C8keqSInMwUhFyRhyMJAWSKiPRHypsDBWxMKGJIssI
3mB4d/go+RP1AvZnazeF
=2wTw
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
For this round we mostly have fixes:
* mesh fixes from Alexis Green and Chun-Yeow Yeoh,
* a documentation fix from Jakub Kicinski,
* a missing channel release (from Michal Kazior),
* a fix for a signal strength reporting bug (from Sara Sharon),
* handle deauth while associating (myself),
* don't report mangled TX SKB back to userspace for status (myself),
* handle aggregation session timeouts properly in fast-xmit (myself)
However, there are also a few cleanups and one big change that
affects all drivers (and that required me to pull in your tree)
to change the mac80211 HW flags to use an unsigned long bitmap
so that we can extend them more easily - we're running out of
flags even with a cleanup to remove the two unused ones.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
SCM_SECURITY was originally only implemented for datagram sockets,
not for stream sockets. However, SCM_CREDENTIALS is supported on
Unix stream sockets. For consistency, implement Unix stream support
for SCM_SECURITY as well. Also clean up the existing code and get
rid of the superfluous UNIXSID macro.
Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211,
where systemd was using SCM_CREDENTIALS and assumed wrongly that
SCM_SECURITY was also supported on Unix stream sockets.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we're running out of hardware capability flags pretty quickly,
convert them to use the regular test_bit() style unsigned long
bitmaps.
This introduces a number of helper functions/macros to set and to
test the bits, along with new debugfs code.
The occurrences of an explicit __clear_bit() are intentional, the
drivers were never supposed to change their supported bits on the
fly. We should investigate changing this to be a per-frame flag.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Merge back net-next to get wireless driver changes (from Kalle)
to be able to create the API change across all trees properly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
FIF_PROMISC_IN_BSS was removed in commit df1404650c
("mac80211: remove support for IFF_PROMISC").
Signed-off-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
SCO/eSCO link is supported by BR/EDR controller, it is
suitable to move them under BT_BREDR config option
Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The return value of l2cap_recv_acldata() and sco_recv_scodata()
are not used, then change it to return void
Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The encryption key size for LTKs is supposed to be applied only at the
moment of encryption. When generating a Link Key (using LE SC) from
the LTK the full non-shortened value should be used. This patch
modifies the code to always keep the full value around and only apply
the key size when passing the value to HCI.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Together with inline routines to associate a vendor commands
array with an NFC device.
Vendor commands allow vendors to implement their very specific
operations from driver code instead of adding new stack ops
for non NFC generic commands.
Vendors need to select their own unique IDs and use that as a
namespace for defining sub commands.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Handle allowing to send proprietary nci commands anywhere in the nci
state machine.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Some device may need to execute some proprietary commands
in order to "wake-up"; Before the nci state initialization.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Allow for drivers to explicitly define handlers for each
proprietary notifications and responses they expect to support.
Reviewed-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
IPv4 and IPv6 share same implementation of get_cookie_sock(),
and there is no point inlining it.
We add tcp_ prefix to the common helper name and export it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To indicate if it's a coordinator or not a bool is enough. There should
no more values available which represent some other state.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the priv attribute in ieee802154_hw to the right
section which is commented by attributes which needs to be filled by
driver layer.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removed an attribute from ieee802154_hw structure which is
never used inside kernel. Address information are stored inside wpan_dev
nowadays.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the ieee802154 hardware flags to enums and setting the
flag values with the BIT macro. Additional this patch changes the
commenting style for matching usual kernel style.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the hardware auto acknowdledge flag which indicates
that the transceiver supports this handling. This flag is never
evaluated inside mac802154 and all transceivers should support this
handling by default per hardware.
Suggested-by: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Alan Ott <alan@signal11.us>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch changes the address filtering flags to enums and setting the
flag values with the BIT macro. Additional this patch changes the
commenting style for matching usual kernel style.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the virtual interface structure from sub if data
struct, because it isn't used anywhere. This structure could be useful
for give per interface information at softmac driver layer. Nevertheless
there exist no use case currently and it contains the interface type
information currently. This information is also stored inside wpan dev
which is now used to check on the wpan dev interface type.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Acked-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When an application needs to force a source IP on an active TCP socket
it has to use bind(IP, port=x).
As most applications do not want to deal with already used ports, x is
often set to 0, meaning the kernel is in charge to find an available
port.
But kernel does not know yet if this socket is going to be a listener or
be connected.
It has very limited choices (no full knowledge of final 4-tuple for a
connect())
With limited ephemeral port range (about 32K ports), it is very easy to
fill the space.
This patch adds a new SOL_IP socket option, asking kernel to ignore
the 0 port provided by application in bind(IP, port=0) and only
remember the given IP address.
The port will be automatically chosen at connect() time, in a way
that allows sharing a source port as long as the 4-tuples are unique.
This new feature is available for both IPv4 and IPv6 (Thanks Neal)
Tested:
Wrote a test program and checked its behavior on IPv4 and IPv6.
strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
connect().
Also getsockname() show that the port is still 0 right after bind()
but properly allocated after connect().
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
IPv6 test :
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
I was able to bind()/connect() a million concurrent IPv4 sockets,
instead of ~32000 before patch.
lpaa23:~# ulimit -n 1000010
lpaa23:~# ./bind --connect --num-flows=1000000 &
1000000 sockets
lpaa23:~# grep TCP /proc/net/sockstat
TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
Check that a given source port is indeed used by many different
connections :
lpaa23:~# ss -t src :40000 | head -10
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow dissector if an MPLS header contains an entropy label this is
saved in the new keyid field of flow_keys. The entropy label is
then represented in the flow hash function input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow dissector if a GRE header contains a keyid this is saved in the
new keyid field of flow_keys. The GRE keyid is then represented
in the flow hash function input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow_dissector set the flow label in flow_keys for IPv6. This also
removes the shortcircuiting of flow dissection when a non-zero label
is present, the flow label can be considered to provide additional
entropy for a hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In flow_dissector set vlan_id in flow_keys when VLAN is found.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to return the IPv6 address hash as part of flow keys.
In general, using the IPv6 address hash is risky in a hash value
since the underlying use of xor provides no entropy. If someone
really needs the hash value they can get it from the full IPv6
addresses in flow keys (e.g. from flow_get_u32_src).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.
We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.
With this patch, Ethertype and IP protocol are now included in the
flow hash input.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes flow hashing to use jhash2 over the flow_keys
structure instead just doing jhash_3words over src, dst, and ports.
This method will allow us take more input into the hashing function
so that we can include full IPv6 addresses, VLAN, flow labels etc.
without needing to resort to xor'ing which makes for a poor hash.
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will export the supported commands by the devices
to the userspace. This will be useful to check if HardMAC
drivers can support a specific command or not.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The naming convention is to always have the flags prefixed with
IEEE80211_HW_ so they're 'namespaced', make this flag follow it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are no drivers setting IEEE80211_HW_2GHZ_SHORT_SLOT_INCAPABLE
or IEEE80211_HW_2GHZ_SHORT_PREAMBLE_INCAPABLE, so any code using the
two flags is dead; it's also exceedingly unlikely that any new driver
could ever need to set these flags.
The wcn36xx code is almost certainly broken, but this preserves the
previous behaviour.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Even if the pointers are really only accessible to root and used
pretty much only by wpa_supplicant, this is still not great; even
for debugging it'd be easier to have something that's easier to
read and guaranteed to never get reused.
With the recent change to make mac80211 create an ack_skb for the
mgmt-tx path this becomes possible, only the client probe method
needs to also allocate an ack_skb, and we can store the cookie in
that skb.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For drivers supporting TSO or similar features, but that still have
PN assignment in software, there's a need to have some memory to
store the current PN value. As mac80211 already stores this and it's
somewhat complicated to add a per-driver area to the key struct (due
to the dynamic sizing thereof) it makes sense to just move the TX PN
to the keyconf, i.e. the public part of the key struct.
As TKIP is more complicated and we won't able to offload it in this
way right now (fast-xmit is skipped for TKIP unless the HW does it
all, and our hardware needs MMIC calculation in software) I've not
moved that for now - it's possible but requires exposing a lot of
the internal TKIP state.
As an bonus side effect, we can remove a lot of code by assuming the
keyseq struct has a certain layout - with BUILD_BUG_ON to verify it.
This might also improve performance, since now TX and RX no longer
share a cacheline.
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux 3.17 and earlier are explicitly engineered so that if the app
doesn't specifically request a CC module on a listener before the SYN
arrives, then the child gets the system default CC when the connection
is established. See tcp_init_congestion_control() in 3.17 or earlier,
which says "if no choice made yet assign the current value set as
default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
created") altered these semantics, so that children got their parent
listener's congestion control even if the system default had changed
after the listener was created.
This commit returns to those original semantics from 3.17 and earlier,
since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]:
Congestion control initialization."), and some Linux congestion
control workflows depend on that.
In summary, if a listener socket specifically sets TCP_CONGESTION to
"x", or the route locks the CC module to "x", then the child gets
"x". Otherwise the child gets current system default from
net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
earlier, and this commit restores that.
Fixes: 55d8694fa8 ("net: tcp: assign tcp cong_ops when tcp sk is created")
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
more things for -next:
* disconnect TDLS stations on CSA to avoid issues
* fix a memory leak introduced in a recent commit
* switch rfkill and cfg80211 to PM ops
* in an unlikely scenario, prevent a bookkeeping
value to get corrupted leading to dropped packets
* fix a crash in VLAN assignment
* switch rfkill-gpio to more modern gpiod API
* send disconnected event to userspace with proper
local/remote indication
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVaEyAAAoJEDBSmw7B7bqrPiIQAKOrX4g2UNtyoTWJzA7YRu+g
GEUu/CE4LQKCodCpBiEhFlhQo2WzXsHoLj5+Nr56aFAZx19VZjXWVC5JS785wYn5
r8hpOVWUUA3MVnXeL/+yz4chm0wTYN9pSpElZ4FHlUI0OkCMh2rPCTvdrbSKoGzV
MN8NEO0jVE89AgOMF8gHk5YKpJ6B4QibZuUuZpgkqdwIi5udaCcrPFFrUg/NfRpA
nTauP6blFUPOUV0sxbhS78uC3rqGQuYsnvab/QeGc9PDKk5ukrXzFdgRCVZq8224
Ge0JcPzwzWldk892oEJoc2OfGkg5HOil9HtC+S2ehBGuK0yEXOBIkO1ZgudTH1kC
0rLOPWVKRzTWE+sq+gWK/OjfaA7Dl6HFYYHRQ2dhm1XkqtAw8SwGQMDSIPJYWr4O
jp4gYpwKVjnMmsEAg7FdKWyIiTgLyI07VnIciORXDyefddYMuofXI2pJkfzUeFeH
HjCVYm2NYXDty6uneP4RC1nUbNc53FKJ5O9fW3BPMyVXD4pTjam50p9H6N7OcDN3
k3dEevWiVgvBjZPVc3HI8RaCzS/Ww1ym+MYgV97QkMfgiuE2VkiFwK+zhWn9axbc
eutkzFEdDcIACCZ74hIWqMJjsMnZm9E11Uq7tifAE0bi1Wpku1xPAnxMPnI+0eiF
Dgo2bmlQ/d1dHr3N3FC0
=KmwY
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
As we get closer to the merge window, here are a few
more things for -next:
* disconnect TDLS stations on CSA to avoid issues
* fix a memory leak introduced in a recent commit
* switch rfkill and cfg80211 to PM ops
* in an unlikely scenario, prevent a bookkeeping
value to get corrupted leading to dropped packets
* fix a crash in VLAN assignment
* switch rfkill-gpio to more modern gpiod API
* send disconnected event to userspace with proper
local/remote indication
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all
options.
2) Allow to bind a table to net_device. This introduces the internal
NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding.
This is required by the next patch.
3) Add the 'netdev' table family, this new table allows you to create ingress
filter basechains. This provides access to the existing nf_tables features
from ingress.
4) Kill unused argument from compat_find_calc_{match,target} in ip_tables
and ip6_tables, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I'd already sent the same fix for -next, but Ben Hutchings
noted it's necessary in 4.1.
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVZwzBAAoJEDBSmw7B7bqrH4YQAMZazIdDKxZNtprjT1WdB+qJ
v82zJJUXLzNf7XPYACmwVQpyJq+uG9GHnvIr1sAU5JNtkHgSJQdRrhhICty0Y5dS
U6dqb/5wvE3oQfPZNYYtEbtUP18j9z34nr18vWhJ3yra1ZWHjk1JJylv8Wy+TVdR
iHQH/VVs8ACcytBQa7my5642ISWUABtOwOGJ0+cWltxbBUGpzGojXskJwclQ47TW
bRqS6vXX27RCtfIKwq5BCdryM5uAyWIDwefRbsl1FEE6ergLUYTRQaaNV93kNCN2
gvitY0xlK2nYHUOodNL03bifg6nnafBkellUTszi4Xfi6dP8WjEtXvpHb2ZHOMpV
fiiYYtN3Uc6/XiW3UUufMGKmRUu2Fhl/4tIoubcmhaurbPXT1yMkoe0V6ZidKZvU
N0sLeuCNnR0T9w2Ml23qUSQhHILT6VoaeW1SqTXBAAdlOcqTXuchU6RbOFRgqJyF
C7VvKE+8+5Rjeu2orgj7bwOQN9cZh8MC1kVhTjjs/jm21CUKRt7pUSqBxmN0WihE
lkfEqohIwti6v/bHOWpB+RnhWKeV1YX4G0ZMIwmn8tP0807hd9b9ov2F8CBK4e+g
ylb4Y45WyzuR4cYGY5kndvV0xl8k7XYiFsUj1QnylFPuhaTtuZ5BtlgDq7CKMDTm
AgkH8YAhLnWvzpTbeDPT
=gYtq
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
This just has a single docbook build fix. In my confusion
I'd already sent the same fix for -next, but Ben Hutchings
noted it's necessary in 4.1.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-05-28
Here's a set of patches intended for 4.2. The majority of the changes
are on the 802.15.4 side of things rather than Bluetooth related:
- All sorts of cleanups & fixes to ieee802154 and related drivers
- Rework of tx power support in ieee802154 and its drivers
- Support for setting ieee802154 tx power through nl802154
- New IDs for the btusb driver
- Various cleanups & smaller fixes to btusb
- New btrtl driver for Realtec devices
- Fix suspend/resume for Realtek devices
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A couple of enums in mac80211.h became structures recently, but the
comments didn't follow suit, leading to errors like:
Error(.//include/net/mac80211.h:367): Cannot parse enum!
Documentation/DocBook/Makefile:93: recipe for target 'Documentation/DocBook/80211.xml' failed
make[1]: *** [Documentation/DocBook/80211.xml] Error 1
Makefile:1361: recipe for target 'mandocs' failed
make: *** [mandocs] Error 2
Fix the comments comments accordingly. Added a couple of other small
comment fixes while I was there to silence other recently-added docbook
warnings.
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds IV generator information to xfrm_state. This
is currently obtained from our own list of algorithm descriptions.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds IV generator information for each AEAD and block
cipher to xfrm_algo_desc. This will be used to access the new
AEAD interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
After commit 07f4c90062 ("tcp/dccp: try to not exhaust
ip_local_port_range in connect()") it is advised to have an even number
of ports described in /proc/sys/net/ipv4/ip_local_port_range
This means start/end values should have a different parity.
Let's warn sysadmins of this, so that they can update their settings
if they want to.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_v4_map_v6 was subtly writing and reading from members
of a union in a way the clobbered data it needed to read before
it read it.
Zeroing the v6 flowinfo overwrites the v4 sin_addr with 0, meaning
that every place that calls sctp_v4_map_v6 gets ::ffff:0.0.0.0 as the
result.
Reorder things to guarantee correct behaviour no matter what the
union layout is.
This impacts user space clients that open an IPv6 SCTP socket and
receive IPv4 connections. Prior to 299ee user space would see a
sockaddr with AF_INET and a correct address, after 299ee the sockaddr
is AF_INET6, but the address is wrong.
Fixes: 299ee123e1 (sctp: Fixup v4mapped behaviour to comply with Sock API)
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for setting the current cca ed level value over
nl802154.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Varka Bhadram <varkabhadram@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We currently always send fragments without DF bit set.
Thus, given following setup:
mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
A R1 R2 B
Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.
However, if R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 will send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.
The other minor issue is that a refragmentation on R1 will conceal the
MTU of R2-B since refragmentation does not set DF bit on the fragments.
This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.
If the DF fragment size is larger or equal to the non-df one, we will
consider the packet a path mtu probe:
We set DF bit on the reassembled skb and also tag it with a new IPCB flag
to force refragmentation even if skb fits outdev mtu.
We will also set DF bit on each fragment in this case.
Joint work with Hannes Frederic Sowa.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds transmission power setting support for IEEE-802.15.4
devices via nl802154.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.
While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :
- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
(Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As there doesn't seem to be a definition of it or any users of it.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This allows us to create netdev tables that contain ingress chains. Use
skb_header_pointer() as we may see shared sk_buffs at this stage.
This change provides access to the existing nf_tables features from the ingress
hook.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must
attach this table to a net_device.
This change is required by the follow up patch that introduces the new netdev
table.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we disconnect from the AP, drivers call cfg80211_disconnect().
This doesn't know whether the disconnection was initiated locally
or by the AP though, which can cause problems with the supplicant,
for example with WPS. This issue obviously doesn't show up with any
mac80211 based driver since mac80211 doesn't call this function.
Fix this by requiring drivers to indicate whether the disconnect is
locally generated or not. I've tried to update the drivers, but may
not have gotten the values correct, and some drivers may currently
not be able to report correct values. In case of doubt I left it at
false, which is the current behaviour.
For libertas, make adjustments as indicated by Dan Williams.
Reported-by: Matthieu Mauger <matthieux.mauger@intel.com>
Tested-by: Matthieu Mauger <matthieux.mauger@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ipv6_select_ident() returns a 32bit value in network order.
Fixes: 286c2349f6 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie. Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.
After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.
After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address. The dst and src can be recovered from
the calling site.
We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.
It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the mib lock. The new locking mechanism is to protect
the mib values with the rtnl lock. Note that this isn't always necessary
if we have an interface up the most mib values are readonly (e.g.
address settings). With this behaviour we can remove locking in
hotpath like frame parsing completely. It depends on context if we need
to hold the rtnl lock or not, this makes the callbacks of
ieee802154_mlme_ops unnecessary because these callbacks hols always the
locks.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch will use atomic operations for sequence number incrementation
while MAC header generation. Upper layers like af_802154 or 6LoWPAN
could call this function in a parallel context while generating 802.15.4
MAC header before queuing into wpan interfaces transmit queue.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the pib lock which is now replaced by rtnl lock. The
new interface already use the rtnl lock only. Nevertheless this patch
will fix issues while using new and old interface at the same time.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1 ("tcp: bind() fix autoselection to share ports")
This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.
It turns out there are other cases where we want to force memory
schedule :
tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel-doc description for the drv_priv member of
struct ieee80211_txq was missing, leading to errors.
Add a suitable description to fix that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:
[...] A host that receives no reply to an ECN-setup SYN within the
normal SYN retransmission timeout interval MAY resend the SYN and
any subsequent SYN retransmissions with CWR and ECE cleared. [...]
Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):
1) Normal ECN-capable path:
SYN ECE CWR ----->
<----- SYN ACK ECE
ACK ----->
2) Path with broken middlebox, when client has fallback:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ----->
<----- SYN ACK
ACK ----->
In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
SYN ECE CWR ----X crappy middlebox drops packet
(timeout, rtx)
In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:
Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
Gorry Fairhurst, and Richard Scheffenegger:
"Enabling Internet-Wide Deployment of Explicit Congestion Notification."
Proc. PAM 2015, New York.
http://ecn.ethz.ch/ecn-pam15.pdf
Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.
tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.
Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_illinois and upcoming tcp_cdg require 64bit alignment of
icsk_ca_priv
x86 does not care, but other architectures might.
Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Fan Du <fan.du@intel.com>
Acked-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support to nl802154 to dump all phy capabilities which is
inside the wpan_phy_supported struct. Also we introduce a new method to
dumping supported channels. The new method will offer a easier interface
and has lesser netlink traffic.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduce a flag property for the wpan phy structure.
The current flag settings in ieee802154_hw are accessable in mac802154
layer only which is okay for flags which indicates MAC handling which
are done by phy. For real PHY layer settings like cca mode, transmit
power, cca energy detection level.
The difference between these flags are that the MAC handling flags are
only handled in mac802154/HardMac layer e.g. on an interface up. The phy
settings are direct netlink calls from nl802154 into the driver layer
and the nl802154 need to have a chance to check if the driver supports
this handling before sending to the next layer.
We also check now on PHY flags while dumping and setting pib attributes.
In comparing with MIB attributes the 802.15.4 gives us an default value
which we assume when a transceiver implement less functionality. In case
of MIB settings the nl802154 layer doesn't need to check on the
ieee802154_hw flags then.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds check if the value is really changed inside pib/mib.
If a transceiver do support only one value for e.g. max_be then this
will also handle that the driver layer doesn't need to care about
handling to set one value only.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds support for phy supported handling for all other already
existing handling 802.15.4 functionality. We assume now a fully 802.15.4
complaint transceiver at phy allocation. If a transceiver can support
802.15.4 default values only, then the values should be overwirtten by
values the transceiver supports. If the transceiver doesn't set the
according hardware flags, we assume the 802.15.4 defaults now which
cannot be changed.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduce the wpan_phy_supported struct for wpan_phy. There
is currently no way to check if a transceiver can handle IEEE 802.15.4
complaint values. With this struct we can check before if the
transceiver supports these values before sending to driver layer.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Acked-by: Varka Bhadram <varkabhadram@gmail.com>
Cc: Alan Ott <alan@signal11.us>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch change the handling of cca energy detection level from dbm to
mbm. This prepares to handle floating point cca energy detection levels
values. The old netlink 802.15.4 will convert the dbm value to mbm for
handling backward compatibility.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch change the handling of transmit power level from dbm to mbm.
This prepares to handle floating point transmit power levels values. The
old netlink 802.15.4 will convert the dbm value to mbm for handling
backward compatibility.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch change the transmit power from s8 to s32. This prepares to store a
mbm value instead dbm inside the transmit power variable. The old
interface keep the a s8 dbm value, which should be backward compatibility
when assign s8 to s32.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.
However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.
This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops
v3: updated to sync with named union changes.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.
SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :
When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.
This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.
Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.
So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a static inline with identical definitions in multiple places...
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only hashes made out of ipv6 addresses could be dissected. This
patch introduces support for dissection of full ipv6 addresses.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce dissector infrastructure which allows user to specify which
parts of skb he wants to dissect.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the definition of the function is in flow_dissector.c, it makes
sense to have the declaration in flow_dissector.h
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since these functions are defined in flow_dissector.c, move header
declarations from skbuff.h into flow_dissector.h
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 56193d1bce ("net: Add function for parsing the header length out
of linear ethernet frames") added this function declaration but it is
defined nowhere.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.
skb_act_clone is now identical to skb_clone, so just use that.
Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
protocol ip u32 match u32 0 0 police rate 10Kbit burst \
64000 mtu 1500 action reclassify
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs. Fix this by using named union on structs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As xfrm_output_one() is the only caller of skb_dst_pop(), we should
make skb_dst_pop() localized.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with
only four switchdev ops: port get/set and port add/del. Other objs will
follow, such as FDB. So go ahead and convert IPv4 FIB over to switchdev
obj for consistency, anticipating more objs to come.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and
call into port driver to get port attrs. For now, only BR_LEARNING and
BR_LEARNING_SYNC are returned. To add more, we'll probably want to break
away from ndo_dflt_bridge_getlink() and build the netlink skb directly in
the switchdev code.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we can remove old wrappers for dellink.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same change as setlink. Provide the wrapper op for SELF ndo_bridge_dellink
and call into the switchdev driver to delete afspec VLANs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
New attr-based bridge_setlink can recurse lower devs and recover on err, so
remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
rocker: use switchdev get/set attr for bridge port flags
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN obj has flags (PVID and untagged) as well as start and end vid ranges.
The switchdev driver can optimize programing the device using the ranges.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like switchdev attr get/set, add new switchdev obj add/del. switchdev objs
will be things like VLANs or FIB entries, so add/del fits better for
objects than get/set used for attributes.
Use same two-phase prepare-commit transaction model as in attr set.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
STP update is just a settable port attribute, so convert
switchdev_port_stp_update to an attr set.
For DSA, the prepare phase is skipped and STP updates are only done in the
commit phase. This is because currently the DSA drivers don't need to
allocate any memory for STP updates and the STP update will not fail to HW
(unless something horrible goes wrong on the MDIO bus, in which case the
prepare phase wouldn't have been able to predict anyway).
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch ID is just a gettable port attribute. Convert switchdev op
switchdev_parent_id_get to a switchdev attr.
Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is
called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only
port netdevs. So when a port is stacked under bond/bridge, the user can
only query switch id via the switch ports, but not via the upper devices
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new swdev ops for get/set switch port attributes. Most swdev
interactions on a port are gets or sets on port attributes, so rather than
adding ops for each attribute, let's define clean get/set ops for all
attributes, and then we can have clear, consistent rules on how attributes
propagate on stacked devs.
Add the basic algorithms for get/set attr ops. Use the same recusive algo
to walk lower devs we've used for STP updates, for example. For get,
compare attr value for each lower dev and only return success if attr
values match across all lower devs. For sets, set the same attr value for
all lower devs. We'll use a two-phase prepare-commit transaction model for
sets. In the first phase, the driver(s) are asked if attr set is OK. If
all OK, the commit attr set in second phase. A driver would NACK the
prepare phase if it can't set the attr due to lack of resources or support,
within it's control. RTNL lock must be held across both phases because
we'll recurse all lower devs first in prepare phase, and then recurse all
lower devs again in commit phase. If any lower dev fails the prepare
phase, we need to abort the transaction for all lower devs.
If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to
indicate get/set only work on port (lowest) device.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only left enqueue_root() user is netem, and it looks not necessary :
qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The port key has three components - user-key, speed-part, and duplex-part.
The LSBit is for the duplex-part, next 5 bits are for the speed while the
remaining 10 bits are the user defined key bits. Get these 10 bits
from the user-space (through the SysFs interface) and use it to form the
admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
it is not provided then use zero for the user-key-bits (default).
It can set using following example code -
# modprobe bonding mode=4
# usr_port_key=$(( RANDOM & 0x3FF ))
# echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* fixed up context from change in ad_actor_sys_prio patch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an AD system, the communication between actor and partner is the
business between these two entities. In the current setup anyone on the
same L2 can "guess" the LACPDU contents and then possibly send the
spoofed LACPDUs and trick the partner causing connectivity issues for
the AD system. This patch allows to use a random mac-address obscuring
it's identity making it harder for someone in the L2 is do the same thing.
This patch allows user-space to choose the mac-address for the AD-system.
This mac-address can not be NULL or a Multicast. If the mac-address is set
from user-space; kernel will honor it and will not overwrite it. In the
absence (value from user space); the logic will default to using the
masters' mac as the mac-address for the AD-system.
It can be set using example code below -
# modprobe bonding mode=4
# sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
$(( (RANDOM & 0xFE) | 0x02 )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )))
# echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: fixed up style issues reported by checkpatch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows user to randomize the system-priority in an ad-system.
The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
does not specify this value, the system defaults to 0xFFFF, which is
what it was before this patch.
Following example code could set the value -
# modprobe bonding mode=4
# sys_prio=$(( 1 + RANDOM + RANDOM ))
# echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* changed how the default value is set in bond_check_params(), this
makes the default consistent between what gets set for a new bond
and what the default is claimed to be in the bonding options.]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions are no longer needed and no longer used kill them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.
Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.
Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.
This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.
A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.
In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.
This makes the host an active participant in the health of the whole
network.
This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.
On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.
Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.
lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
count 0 lastcount 0 ldelay 1.0ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
count 0 lastcount 0 ldelay 694us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
count 0 lastcount 0 ldelay 889us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.
With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.
Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
a lot of small fixes and cleanups, the bigger items are:
* proper mac80211 rate control locking, to fix some random crashes
(this required changing other locking as well)
* mac80211 "fast-xmit", a mechanism to reduce, in most cases, the
amount of code we execute while going from ndo_start_xmit() to
the driver
* this also clears the way for properly supporting S/G and checksum
and segmentation offloads
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVSh65AAoJEDBSmw7B7bqrudwP/0iXyNQhF0mLTENrx+rdsDZS
qQhB/8wejJaOJb89Re7M+bhwri7Q6S5BM/G24vhMc01dxmqNMcdKfEV3+nlmc5C+
KeEgTI9aZiCnUt4WAd54Zwbkc9o+1kBtaFuaWDvOdQHUf0WDwEIQxjnV4+SZujV9
xl1TV5yV35hRQgrDE8ZSbtOYRmhSVoi0MEgwqAjzdN2fEPyWVeqwYULDtpOopjL2
UHQgv0E2fYVRWennHyQQ88tWBQg+EsRaG1U1/rYHhNBmAJ+f9AOxKi7ErzxYfkbM
961B+3E++pM+zUeqw6+jaMKqT5jeCCM5ugCNSG4NrIvfxDIDgecAFV9Fs2islnI4
8xd3GqyA5iqaitAWIUsaYaQfaAcwSIlpSinfQW9EUm2wuCkPyZboFP+GRd2K7sQn
FnRJSJ9PkGPdWwdDE3gunLHBHtbDS0z+R8VegIeS0qT8LamkqICiNQSyPlsTeluW
ig2kwHsDdj3k11wyelhfp/RdtsOch/brKpLSjdzPXC1BzIWhQLwmsPh9qZ83vSB9
qbLsdnM/IPQXocWB6fOhmwaGsLeRalxs2yQFM0zdJCwpaU9dzKsJrxepAXVuq31p
r0fygWTp8GVevHXzfS7fRya8xjsTRrSs6n2kH7ErOfiep13HQypAjbyLswNe4kW/
D6x8pVC3AhdGkl/9CW4m
=oUlh
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Lots of updates for net-next for this cycle. As usual, we have
a lot of small fixes and cleanups, the bigger items are:
* proper mac80211 rate control locking, to fix some random crashes
(this required changing other locking as well)
* mac80211 "fast-xmit", a mechanism to reduce, in most cases, the
amount of code we execute while going from ndo_start_xmit() to
the driver
* this also clears the way for properly supporting S/G and checksum
and segmentation offloads
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Diagnosing problems related to Window Probes has been hard because
we lack a counter.
TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.
TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the advent of small rto timers in datacenter TCP,
(ip route ... rto_min x), the following can happen :
1) Qdisc is full, transmit fails.
TCP sets a timer based on icsk_rto to retry the transmit, without
exponential backoff.
With low icsk_rto, and lot of sockets, all cpus are servicing timer
interrupts like crazy.
Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
2) Receivers can send zero windows if they don't drain their receive queue.
TCP sends zero window probes, based on icsk_rto current value, with
exponential backoff.
With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
some cases), sender can abort in less than one or two minutes !
If receiver stops the sender, it obviously doesn't care of very tight
rto. Probability of dropping the ACK reopening the window is not
worth the risk.
Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
events (but not normal RTO based retransmits)
A followup patch adds a new SNMP counter, as it would have helped a lot
diagnosing this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GO_CONCURRENT regulatory definition can be extended to station
interfaces requesting to IR as part of TDLS off-channel operations.
Rename the GO_CONCURRENT flag to IR_CONCURRENT and allow the added
use-case.
Change internal users of GO_CONCURRENT to use the new definition.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For ciphers not supported by mac80211, the function currently
doesn't return any PN data. Fix this by extending the driver's
get_key_seq() a little more to allow moving arbitrary PN data.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Extend the function to read the TKIP IV32/IV16 to read the IV/PN for
all ciphers in order to allow drivers with full hardware crypto to
properly support this.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a code cleanup, make the LED trigger names const
as they're not expected to be modified by drivers.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* a fix for an issue with hash collision handling in the
rhashtable conversion
* a merge issue - rhashtable removed default shrinking
just before mac80211 was converted, so enable it now
* remove an invalid WARN that can trigger with legitimate
userspace behaviour
* add a struct member missing from kernel-doc that caused
a lot of warnings
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVR1FIAAoJEDBSmw7B7bqrbkQP/RPleNMLHNOLRoJva6vNVers
hiyLwG+lpV9yHCUX1I7TsvyTHddjW3ANNC1LvlKyUiNzBBsr/Yg/Qe1RVGkRTK7E
ZPTU7xutk2nSLjvbVLjTQtIAynKZM6W9vM3X16WtsZ9+25z4e8jN8BjhzP99Vs7X
IkwL7Qe2q1O/ta+/ByVR4dFWtGyJDxRjXkFs4cQ/VdL/fJalKxIa9YJrmUwIBMOU
UIuJ1lh7kkNFGky+4Y3dBA/2L+O/iS64dxe+ygLMJGf9xdGGHwyuM6oe6OKEKPi1
pF0NIWgHRGOUEJNyUKZTOwCyk1PeOr9aFFDHN4GJyE463KzZqVYdA4ajI6udwL2h
I/AjsBnM0zQcoBD8HUeJK2ZgndPxOrMZGVjs0/M3K9lFocYBALoLkt5YPD1PXOZs
9EQ8mhej3/xbwyEaaAp+HJPOr/wLZ//juxsPP8HSR1BDm8cR40rd9rETEGWnmrxt
5vpmb2vNpKekw+pKf2vwuEH6BwH02nuz4YPudTswiEzObEQAuGx+rpopQVI9bdgE
V/r3WY0cl+bDMrcSpoZ7v6FDoHHA/zPq9aAAV0JsMyYApY0nMdqwShBJq5cTSoK9
rK5I6oIbqChskpoWULUVbbjb1fLsNVI4C89KjxW8xziTEY1d2fEeeJ2tR0lYBX/5
/KOygSze3Qi0MjrdE9D2
=sf/A
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
We have only a few fixes right now:
* a fix for an issue with hash collision handling in the
rhashtable conversion
* a merge issue - rhashtable removed default shrinking
just before mac80211 was converted, so enable it now
* remove an invalid WARN that can trigger with legitimate
userspace behaviour
* add a struct member missing from kernel-doc that caused
a lot of warnings
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-05-04
Here's the first bluetooth-next pull request for 4.2:
- Various fixes for at86rf230 driver
- ieee802154: trace events support for rdev->ops
- HCI UART driver refactoring
- New Realtek IDs added to btusb driver
- Off-by-one fix for rtl8723b in btusb driver
- Refactoring of btbcm driver for both UART & USB use
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate 90 of these warnings:
Warning(..//include/net/mac80211.h:1682): No description found for parameter 'drv_priv[0]'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some users of flow keys (well just sch_choke now) need to pass
flow_keys in skbuff cb, and use them for exact comparisons of flows
so that skb->hash is not sufficient. In order to increase size of
the flow_keys structure, we introduce another structure for
the purpose of passing flow keys in skbuff cb. We limit this structure
to sixteen bytes, and we will technically treat this as a digest of
flow_keys struct hence its name flow_keys_digest. In the first
incaranation we just copy the flow_keys structure up to 16 bytes--
this is the same information previously passed in the cb. In the
future, we'll adapt this for larger flow_keys and could use something
like SHA-1 over the whole flow_keys to improve the quality of the
digest.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
backlog 13626b 3p requeues 0
count 0 lastcount 0 ldelay 96.9ms drop_next 0us
maxpacket 9084 ecn_mark 0 drop_overlimit 0
This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.
codel intent was to use MTU of the device.
After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.
qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
backlog 6056b 3p requeues 0
count 1 lastcount 1 ldelay 36.3ms drop_next 0us
maxpacket 10598 ecn_mark 0 drop_overlimit 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.
Background:
IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.
Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one. Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.
This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.
Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not used.
pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.
And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
_rt6i_peer is no longer needed after the last patch,
'ipv6: Stop rt6_info from using inet_peer's metrics'.
DST_METRICS_FORCE_OVERWRITE is added by
commit e5fd387ad5 ("ipv6: do not overwrite inetpeer metrics prematurely").
Since inetpeer is no longer used for metrics, this bit is also not needed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_peer is indexed by the dst address alone. However, the fib6 tree
could have multiple routing entries (rt6_info) for the same dst. For
example,
1. A /128 dst via multiple gateways.
2. A RTF_CACHE route cloned from a /128 route.
In the above cases, all of them will share the same metrics and
step on each other.
This patch will steer away from inet_peer's metrics and use
dst_cow_metrics_generic() for everything.
Change Highlights:
1. Remove rt6_cow_metrics() which currently acquires metrics from
inet_peer for DST_HOST route (i.e. /128 route).
2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a
full size metrics just to override the RTAX_MTU.
3. After (2), the RTF_CACHE route can also share the metrics with its
dst.from route, by:
dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true);
4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route. Instead,
directly clone from rt->dst.
[ Currently, cloning from another RTF_CACHE is only possible during
rt6_do_redirect(). Also, the old clone is removed from the tree
immediately after the new clone is added. ]
In case of cloning from an older redirect RTF_CACHE, it should work as
before.
In case of cloning from an older pmtu RTF_CACHE, this patch will forget
the pmtu and re-learn it (if there is any) from the redirected route.
The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed
in the next cleanup patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This code is based on commit 6bab2e19c5
("cfg80211: pass name_assign_type to rdev_add_virtual_intf()")
This will expose in sysfs whether the ifname of a IEEE-802.15.4
device is set by userspace or generated by the kernel.
We are using two types of name_assign_types
o NET_NAME_ENUM: Default interface name provided by kernel
o NET_NAME_USER: Interface name provided by user.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds the proper description to the mac802154 core APIs.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We would like that optional info provided by Congestion Control
modules using netlink can also be read using getsockopt()
This patch changes get_info() to put this information in a buffer,
instead of skb, like tcp_get_info(), so that following patch
can reuse this common infrastructure.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.
RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION
Fixing that by defining a private bonding_priv.h
header files which includes those defines.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243
There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.
Before commit fa76ce7328 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.
To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.
This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.
(Same remark for dccp_check_req() and its callers)
inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.
Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When frames time out in the reordering buffer, it is a
good indication that something went wrong and the driver
may want to know about that to take action or trigger
debug flows.
It is pointless to notify the driver about each frame that
is released. Notify each time the timer fires.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we receive a BAR, this typically means that our peer
doesn't hear our Block-Acks or that we can't hear its
frames. Either way, it is a good indication that the link
is in a bad condition. This is why it can serve as a probe
to the driver.
Use the event_callback callback for this.
Since more events with the same data will be added in the
feature, the structure that describes the data attached to
the event is called in a generic name: ieee80211_ba_event.
This also means that from now on, the event_callback can't
sleep.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This support is essentially useless as typically networks are encrypted,
frames will be filtered by hardware, and rate scaling will be done with
the intended recipient in mind. For real monitoring of the network, the
monitor mode support should be used instead.
Removing it removes a lot of corner cases.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If drivers want to support S/G (really just gather DMA on TX) then
we can now easily support this on the fast-xmit path since it just
needs to write to the ethernet header (and already has a check for
that being possible.)
However, disallow this on the regular TX path (which has to handle
fragmentation, software crypto, etc.) by calling skb_linearize().
Also allow the related HIGHDMA since that's not interesting to the
code in mac80211 at all anyway.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to speed up mac80211's TX path, add the "fast-xmit" cache
that will cache the data frame 802.11 header and other data to be
able to build the frame more quickly. This cache is rebuilt when
external triggers imply changes, but a lot of the checks done per
packet today are simplified away to the check for the cache.
There's also a more detailed description in the code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A couple of enums in mac80211.h became structures recently, but the
comments didn't follow suit, leading to errors like:
Error(.//include/net/mac80211.h:367): Cannot parse enum!
Documentation/DocBook/Makefile:93: recipe for target 'Documentation/DocBook/80211.xml' failed
make[1]: *** [Documentation/DocBook/80211.xml] Error 1
Makefile:1361: recipe for target 'mandocs' failed
make: *** [mandocs] Error 2
Fix the comments comments accordingly. Added a couple of other small
comment fixes while I was there to silence other recently-added docbook
warnings.
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull networking fixes from David Miller:
1) Fix verifier memory corruption and other bugs in BPF layer, from
Alexei Starovoitov.
2) Add a conservative fix for doing BPF properly in the BPF classifier
of the packet scheduler on ingress. Also from Alexei.
3) The SKB scrubber should not clear out the packet MARK and security
label, from Herbert Xu.
4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.
5) Pause handling is not correct in the stmmac driver because it
doesn't take into consideration the RX and TX fifo sizes. From
Vince Bridgers.
6) Failure path missing unlock in FOU driver, from Wang Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
net: dsa: use DEVICE_ATTR_RW to declare temp1_max
netns: remove BUG_ONs from net_generic()
IB/ipoib: Fix ndo_get_iflink
sfc: Fix memcpy() with const destination compiler warning.
altera tse: Fix network-delays and -retransmissions after high throughput.
net: remove unused 'dev' argument from netif_needs_gso()
act_mirred: Fix bogus header when redirecting from VLAN
inet_diag: fix access to tcp cc information
tcp: tcp_get_info() should fetch socket fields once
net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
skbuff: Do not scrub skb mark within the same name space
Revert "net: Reset secmark when scrubbing packet"
bpf: fix two bugs in verification logic when accessing 'ctx' pointer
bpf: fix bpf helpers to use skb->mac_header relative offsets
stmmac: Configure Flow Control to work correctly based on rxfifo size
stmmac: Enable unicast pause frame detect in GMAC Register 6
stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
stmmac: Add defines and documentation for enabling flow control
stmmac: Add properties for transmit and receive fifo sizes
stmmac: fix oops on rmmod after assigning ip addr
...
This inline has ~500 callsites.
On 04/14/2015 08:37 PM, David Miller wrote:
> That BUG_ON() was added 7 years ago, and I don't remember it ever
> triggering or helping us diagnose something, so just remove it and
> keep the function inlined.
On x86 allyesconfig build:
text data bss dec hex filename
82447071 22255384 20627456 125329911 77861f7 vmlinux4
82441375 22255384 20627456 125324215 7784bb7 vmlinux5prime
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: David S. Miller <davem@davemloft.net>
CC: Jan Engelhardt <jengelh@medozas.de>
CC: Jiri Pirko <jpirko@redhat.com>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Two different problems are fixed here :
1) inet_sk_diag_fill() might be called without socket lock held.
icsk->icsk_ca_ops can change under us and module be unloaded.
-> Access to freed memory.
Fix this using rcu_read_lock() to prevent module unload.
2) Some TCP Congestion Control modules provide information
but again this is not safe against icsk->icsk_ca_ops
change and nla_put() errors were ignored. Some sockets
could not get the additional info if skb was almost full.
Fix this by returning a status from get_info() handlers and
using rcu protection as well.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull second vfs update from Al Viro:
"Now that net-next went in... Here's the next big chunk - killing
->aio_read() and ->aio_write().
There'll be one more pile today (direct_IO changes and
generic_write_checks() cleanups/fixes), but I'd prefer to keep that
one separate"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
->aio_read and ->aio_write removed
pcm: another weird API abuse
infinibad: weird APIs switched to ->write_iter()
kill do_sync_read/do_sync_write
fuse: use iov_iter_get_pages() for non-splice path
fuse: switch to ->read_iter/->write_iter
switch drivers/char/mem.c to ->read_iter/->write_iter
make new_sync_{read,write}() static
coredump: accept any write method
switch /dev/loop to vfs_iter_write()
serial2002: switch to __vfs_read/__vfs_write
ashmem: use __vfs_read()
export __vfs_read()
autofs: switch to __vfs_write()
new helper: __vfs_write()
switch hugetlbfs to ->read_iter()
coda: switch to ->read_iter/->write_iter
ncpfs: switch to ->read_iter/->write_iter
net/9p: remove (now-)unused helpers
p9_client_attach(): set fid->uid correctly
...
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
A final pull request, I know it's very late but this time I think it's worth a
bit of rush.
The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.
This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.
This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.
Patrick McHardy says:
====================
netfilter: nf_tables: concatenation support
The following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).
The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.
Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.
Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.
As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.
The patches are split up as follows:
* the first five patches add length validation to register loads and
stores to make sure we stay within bounds and prepare the validation
functions for the new addressing mode
* the next patches prepare for changing to 32 bit addressing by
introducing a struct nft_regs, which holds the verdict register as
well as the data registers. The verdict members are moved to a new
struct nft_verdict to allow to pull struct nft_data out of the stack.
* the next patches contain preparatory conversions of expressions and
sets to use 32 bit addressing
* the next patch introduces so far unused register conversion helpers
for parsing and dumping register numbers over netlink
* following is the real conversion to 32 bit addressing, consisting of
replacing struct nft_data in struct nft_regs by an array of u32s and
actually translating and validating the new register numbers.
* the final two patches add support for variable sized data items and
variable sized keys / data in set elements
The patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================
Patrick McHardy says:
====================
netfilter: nf_tables: dynamic stateful expression instantiation
The following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.
Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.
In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.
We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.
The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:
1. Limit the rate of new SSH connections per host, similar to iptables
hashlimit:
flow ip saddr timeout 60s \
limit 10/second \
accept
2. Account network traffic between each set of /24 networks:
flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
counter
3. Account traffic to each host per user:
flow skuid . ip daddr \
counter
4. Account traffic for each combination of source address and TCP flags:
flow ip saddr . tcp flags \
counter
The resulting set content after a Xmas-scan look like this:
{
192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
192.168.122.1 . ack : counter packets 74 bytes 3848,
192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}
In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro says:
====================
netdev-related stuff in vfs.git
There are several commits sitting in vfs.git that probably ought to go in
via net-next.git. First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago. It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter. The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO. Next there's
afs and rxrpc stuff from dhowells. And then there's sanitizing kernel_sendmsg
et.al. + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g. cifs stuff, but it belongs in net-next.
That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem
I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.
This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)
We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.
Tested:
On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)
Before patch :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171
While test is running, we can observe 25 or even 33 ms latencies.
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2
After patch :
About 90% increase of throughput :
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442
lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992
And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :
lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a flag to mark stateful expressions.
This is used for dynamic expression instanstiation to limit the usable
expressions. Strictly speaking only the dynset expression can not be
used in order to avoid recursion, but since dynamically instantiating
non-stateful expressions will simply create an identical copy, which
behaves no differently than the original, this limits to expressions
where it actually makes sense to dynamically instantiate them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Preparation to attach expressions to set elements: add a set extension
type to hold an expression and dump the expression information with the
set element.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helper functions for initializing, cloning, dumping and destroying
a single expression that is not part of a rule.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch changes sets to support variable sized set element keys / data
up to 64 bytes each by using variable sized set extensions. This allows
to use concatenations with bigger data items suchs as IPv6 addresses.
As a side effect, small keys/data now don't require the full 16 bytes
of struct nft_data anymore but just the space they need.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a size argument to nft_data_init() and pass in the available space.
This will be used by the following patches to support variable sized
set element data.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.
The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.
To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.
Userspace fully passes the testsuite using both old and new register
addressing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helper functions to parse and dump register values in netlink attributes.
These helpers will later be changed to take care of translation between the
old 128 bit and the new 32 bit register numbers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simple conversion to use u32 pointers to the beginning of the data
area to keep follow up patches smaller.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only needlessly complicates things due to requiring specific argument
types. Use memcmp directly.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.
This is needed to seperate the verdict from the data registers completely,
so their size can be changed.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change nft_validate_input_register() to not only validate the input
register number, but also the length of the load, and rename it to
nft_validate_register_load() to reflect that change.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All users of nft_validate_register_store() first invoke
nft_validate_output_register(). There is in fact no use for using it
on its own, so simplify the code by folding the functionality into
nft_validate_register_store() and kill it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The existing name is ambiguous, data is loaded as well when we read from
a register. Rename to nft_validate_register_store() for clarity and
consistency with the upcoming patch to introduce its counterpart,
nft_validate_register_load().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For values spanning multiple registers, we need to validate that enough
space is available from the destination register onwards. Add a len
argument to nft_validate_data_load() and consolidate the existing length
validations in preparation of that.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* new mac80211 internal software queue to allow drivers to have
shorter hardware queues and pull on-demand
* use rhashtable for mac80211 station table
* minstrel rate control debug improvements and some refactoring
* fix noisy message about TX power reduction
* fix continuous message printing and activity if CRDA doesn't respond
* fix VHT-related capabilities with "iw connect" or "iwconfig ..."
* fix Kconfig for cfg80211 wireless extensions compatibility
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVJ7CPAAoJEDBSmw7B7bqr8+IQAKCAbUyd6PFRT5tcz9kW5GCW
/ibb+n1e14yWKgNEe1gddUGKG/L3HGCBXNkCYzR2M8mlL7dLPqspaBcGHS4dx8F4
D0AuikqvtXIxfAXi0zmU2uo7rH7u2X34R2LtS8AlKByD+jmFvxMiPPvxNFgzJu/7
63UQm73p2pnu/KdXLW1OQEcpZtZJ9+N/uBiq9zbVdX3A8T84ME0oMyy+EAQqCZdK
CcsTXHCnAgmmXWJlu1JRdopr1bd38mSGB70eXduFtPqDdmtQRnoaCQ9e+tJDA4j4
svEw0yDmsc4WG1EKLKKCRd3uFOZsng+lcXrHfpm5wlSPpCOItfQ9BzT3x1u6Y5JU
Z1WMOMkkEce+95U7/RLoXwC/2RS3XelUXTde4cGIRMvO5drOrU58P0gdn3J+yKbv
6v+2GGKy/39tdXUOxIl3EZT/huIl+h1UNO8C2hyaEwdXK+X1zl31/u6kk1Ns18Wr
YPEJixxHx0zR8jaZgDC7OlWLuqn4Ay+Ls9yCyIesdHzKpizJKqn83PntYnpJmxoA
9hlIyRDWnqH44KxzB85ni1C2Qudec3mcCWIWV7M+UoSC1Cgs/LxDzH7kRejR2ZIl
vRhg5pqyr53L0h2lq5DO4Cj4UzbXb7YioKJRxjyKloNOlRrCZtK/VEsHbdsKEcIp
d/wHj1AyFZeQfuhk8Qqr
=mtuo
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
There isn't much left, but we have
* new mac80211 internal software queue to allow drivers to have
shorter hardware queues and pull on-demand
* use rhashtable for mac80211 station table
* minstrel rate control debug improvements and some refactoring
* fix noisy message about TX power reduction
* fix continuous message printing and activity if CRDA doesn't respond
* fix VHT-related capabilities with "iw connect" or "iwconfig ..."
* fix Kconfig for cfg80211 wireless extensions compatibility
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-04-09
We've had enough new patches during the past week (especially from
Marcel) that it'd be good to still get these queued for 4.1.
The majority of the changes are from Marcel with lots of cleanup &
refactoring patches for the HCI UART driver. Marcel also split out some
Broadcom & Intel vendor specific functionality into two new btintel &
btbcm modules.
In addition to the HCI driver changes there's the completion of our
local OOB data interface for pairing, added support for requesting
remote LE features when connecting, as well as a couple of minor fixes
for mac802154.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
They are:
* nf_tables set timeout infrastructure from Patrick Mchardy.
1) Add support for set timeout support.
2) Add support for set element timeouts using the new set extension
infrastructure.
4) Add garbage collection helper functions to get rid of stale elements.
Elements are accumulated in a batch that are asynchronously released
via RCU when the batch is full.
5) Add garbage collection synchronization helpers. This introduces a new
element busy bit to address concurrent access from the netlink API and the
garbage collector.
5) Add timeout support for the nft_hash set implementation. The garbage
collector peridically checks for stale elements from the workqueue.
* iptables/nftables cgroup fixes:
6) Ignore non full-socket objects from the input path, otherwise cgroup
match may crash, from Daniel Borkmann.
7) Fix cgroup in nf_tables.
8) Save some cycles from xt_socket by skipping packet header parsing when
skb->sk is already set because of early demux. Also from Daniel.
* br_netfilter updates from Florian Westphal.
9) Save frag_max_size and restore it from the forward path too.
10) Use a per-cpu area to restore the original source MAC address when traffic
is DNAT'ed.
11) Add helper functions to access physical devices.
12) Use these new physdev helper function from xt_physdev.
13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
state information.
14) Annotate original layer 2 protocol number in nf_bridge info, instead of
using kludgy flags.
15) Also annotate the pkttype mangling when the packet travels back and forth
from the IP to the bridge layer, instead of using a flag.
* More nf_tables set enhancement from Patrick:
16) Fix possible usage of set variant that doesn't support timeouts.
17) Avoid spurious "set is full" errors from Netlink API when there are pending
stale elements scheduled to be released.
18) Restrict loop checks to set maps.
19) Add support for dynamic set updates from the packet path.
20) Add support to store optional user data (eg. comments) per set element.
BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Netlink attribute for the power is s8. But for the driver level
operations we are collection power level value into integer.
It has to be change to s8 from int.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When establishing a Bluetooth LE connection, read the remote used
features mask to determine which features are supported. This was
not really needed with Bluetooth 4.0, but since Bluetooth 4.1 and
also 4.2 have introduced new optional features, this becomes more
important.
This works the same as with BR/EDR where the connection enters the
BT_CONFIG stage and hci_connect_cfm call is delayed until the remote
features have been retrieved. Only after successfully receiving the
remote features, the connection enters the BT_CONNECTED state.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is the NFC pull request for 4.1.
This is a shorter one than usual, as the Intel Field Peak NFC
driver could not make it in time.
We have:
- A new driver for NXP NCI based chipsets, like e.g. the NPC100 or
the PN7150. It currently only supports an i2c physical layer, but
could easily be extended to work on top of e.g. SPI.
This driver also includes support for user space triggered firmware
updates.
- A few minor st21nfc[ab] fixes, cleanups, and comments improvements.
- A pn533 error return fix.
- A few NFC related logs formatting cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVJV81AAoJEIqAPN1PVmxKM3sP/R6fKXMxtVxXsiPzBMk+SpBI
onNXbCx27rp1lHTFznE16JAaaSfcQEwSQmYx7xa6KXqRYVlDfRfC+5R+rPGrCjfH
kV/eLBNnYpmPw0ViVT7dsWK0b4me0k5pr9ki9mze3YxvuMbA5vdv0tvuRFz5IRF/
hl+WI5pntGuRtnIyBKIauAMylylUYVvCBGhmHnveiX0Dp8noJLBSV0wvdzujm51S
+Uio/jHlUEV2lotrQBOfWNtEkwonXSwzZWSzimBCyEGLAwTx4lGXHmQftOtPd3zE
sOT7Gw77ZCsulHoHiJyC0KpDSDS0NYVrtTI5BiuGgAivGi0YZw2XD3CYRIdy0m2Z
lMoodYdiCgsxmrU6I6Rd/7DQBxD0Vhc+CGAyk41f7AwU4Fwq105kpSLupTtU+NzT
ZpdvSXeirU8sxt+3WDOgv8esyYGZVVD8GuBbofMZQZ0vLq+FcDpYAW4w3LKvvi6X
C+WN8f7SCI0kqpd4leyl6EG3SoQKFyPWobu0IlV520R9b76iBcyqooTIpvVa4Yk7
az6fKhi9gK/T6FHW78y6fnkczd47JKrC924m+g6P/hhTD5zQ956ferp0uFTjkPtF
8kNgRT7OIRi1JO783cnQx1uA61axC3GX1HFzsD9paVkTwRNtJ3eFsxtcYt7Nfv8D
WGNWRQp5LmBsD/SqFBfk
=MzOJ
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC: 4.1 pull request
This is the NFC pull request for 4.1.
This is a shorter one than usual, as the Intel Field Peak NFC
driver could not make it in time.
We have:
- A new driver for NXP NCI based chipsets, like e.g. the NPC100 or
the PN7150. It currently only supports an i2c physical layer, but
could easily be extended to work on top of e.g. SPI.
This driver also includes support for user space triggered firmware
updates.
- A few minor st21nfc[ab] fixes, cleanups, and comments improvements.
- A pn533 error return fix.
- A few NFC related logs formatting cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an userdata set extension and allow the user to attach arbitrary
data to set elements. This is intended to hold TLV encoded data like
comments or DNS annotations that have no meaning to the kernel.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new "dynset" expression for dynamic set updates.
A new set op ->update() is added which, for non existant elements,
invokes an initialization callback and inserts the new element.
For both new or existing elements the extenstion pointer is returned
to the caller to optionally perform timer updates or other actions.
Element removal is not supported so far, however that seems to be a
rather exotic need and can be added later on.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently a set binding is assumed to be related to a lookup and, in
case of maps, a data load.
In order to use bindings for set updates, the loop detection checks
must be restricted to map operations only. Add a flags member to the
binding struct to hold the set "action" flags such as NFT_SET_MAP,
and perform loop detection based on these.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use atomic operations for the element count to avoid races with async
updates.
To properly handle the transactional semantics during netlink updates,
deleted but not yet committed elements are accounted for seperately and
are treated as being already removed. This means for the duration of
a netlink transaction, the limit might be exceeded by the amount of
elements deleted. Set implementations must be prepared to handle this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies. This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts. If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.
The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.
The change has passed all existing Fast Open tests with both
old and new options at Google.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since Bluetooth 4.1 there are two additional values for SSP OOB data,
namely C-256 and R-256. This patch updates the EIR definitions to take
into account both the 192 and 256 bit variants of C and R.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.
Based upon a patch by Jiri Pirko.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
The hci_recv_stream_fragment function should have never been introduced
in the first place. The Bluetooth core does not need to know anything
about the HCI transport protocol.
With all transport protocol specific detailed moved back into the
drivers where they belong (mainly generic USB and UART drivers), this
function can now be removed.
This reduces the size of hci_dev structure and also removes an exported
symbol from the Bluetooth core module.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The data pointer provided to hci_recv_stream_fragment function should
have been marked const. The function has no business in modifying the
original data. So fix this now.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-04-04
Here's what's probably the last bluetooth-next pull request for 4.1:
- Fixes for LE advertising data & advertising parameters
- Fix for race condition with HCI_RESET flag
- New BNEPGETSUPPFEAT ioctl, needed for certification
- New HCI request callback type to get the resulting skb
- Cleanups to use BIT() macro wherever possible
- Consolidate Broadcom device entries in the btusb HCI driver
- Check for valid flags in CMTP, HIDP & BNEP
- Disallow local privacy & OOB data combo to prevent a potential race
- Expose SMP & ECDH selftest results through debugfs
- Expose current Device ID info through debugfs
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As the next patch will require the IE splitting utility functions
in cfg80211, move them there from mac80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/cmd.c
net/core/fib_rules.c
net/ipv4/fib_frontend.c
The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.
The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.
ipv6 does not conform with this in three places:
1) ip6_fragment: we do consult ipv6_npinfo for frag_size
2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
loop the packet back to the local socket
3) ip6_skb_dst_mtu could query the settings from the user socket and
force a wrong MTU
Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.
Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to specification etsi 102 622 chapter 4.4 pipes
identifier is 7 bits long giving a 127 possible pipes value.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to etsi 102 622 chapter 11.2.2.4 EVT_TRANSACTION,
the nfc_evt_transaction parameters can be 0 up to 255 byte long.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to specification etsi 102 622 chapter 4.4 pipes identifier
is 7 bits long giving a 127 possible pipes value.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
That way we don't have to reinstantiate another nf_hook_state
on the stack of the nf_reinject() path.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't use dev->iflink anymore.
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't use dev->iflink anymore.
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there's a HCI request API available where the callback receives
the resulting skb, we can convert the local OOB data reading to use this
new API. This patch does the necessary update in mgmt.c (which also
requires moving the callback higher up since it's now a static function)
and removes the custom calls from hci_event.c that are no-longer
necessary.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The hci_req_pending() function has no users anymore, so simply remove
it.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Now that the synchronous HCI requests use the new API and a new private
variable the recv_evt member of hci_dev is no-longer needed. This patch
removes it.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Now that there's an API in place that allows passing the resulting skb
to the request callback we can conveniently convert the hci_req_sync and
related functions to use it. Since we still need to get the skb from the
async callback into the sleeping _sync() function the patch adds another
req_skb variable to hci_dev where the sync request state is tracked.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds a second possible callback for HCI requests where the
callback will receive the full skb of the last successfully completed
HCI command. This API is useful for cases where we want to use a request
to read some data and the existing hci_event.c handlers do not store it
e.g. in the hci_dev struct.
The reason the patch is a bit bigger than just adding the new API is
because the hci_req_cmd_complete() functions required some refactoring
to enable it: now hci_req_cmd_complete() is simply used to request the
callback pointers if any, and the actual calling of them happens from a
single place at the end of hci_event_packet(). The reason for this is
that we need to pass the original skb (without any skb_pull, etc
modifications done to it) and it's simplest to keep track of it within
the hci_event_packet() function.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This allows drivers to request per-vif and per-sta-tid queues from which
they can pull frames. This makes it easier to keep the hardware queues
short, and to improve fairness between clients and vifs.
The task of scheduling packet transmission is left up to the driver -
queueing is controlled by mac80211. Drivers can only dequeue packets by
calling ieee80211_tx_dequeue. This makes it possible to add active queue
management later without changing drivers using this code.
This can also be used as a starting point to implement A-MSDU
aggregation in a way that does not add artificially induced latency.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[resolved minor context conflict, minor changes, endian annotations]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for element timeouts to nft_hash. The lookup and walking
functions are changed to ignore timed out elements, a periodic garbage
collection task cleans out expired entries.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
GC is expected to happen asynchrously to the netlink interface. In the
netlink path, both insertion and removal of elements consist of two
steps, insertion followed by activation or deactivation followed by
removal, during which the element must not be freed by GC.
The synchronization helpers use an unused bit in the genmask field to
atomically mark an element as "busy", meaning it is either currently
being handled through the netlink API or by GC.
Elements being processed by GC will never survive, netlink will simply
ignore them. Elements being currently processed through netlink will be
skipped by GC and reprocessed during the next run.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add helpers for GC batch destruction: since element destruction needs
a RCU grace period for all set implementations, add some helper functions
for asynchronous batch destruction. Elements are collected in a batch
structure, which is asynchronously released using RCU once its full.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add API support for set element timeouts. Elements can have a individual
timeout value specified, overriding the sets' default.
Two new extension types are used for timeouts - the timeout value and
the expiration time. The timeout value only exists if it differs from
the default value.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add set timeout support to the netlink API. Sets with timeout support
enabled can have a default timeout value and garbage collection interval
specified.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
of small fixes, cleanups and internal features we have:
* VHT support for TDLS and IBSS (conditional on drivers though)
* first TX performance improvements (the biggest will come later)
* many suspend/resume (race) fixes
* name_assign_type support from Tom Gundersen
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJVGWlNAAoJEDBSmw7B7bqr5s8P/R9+Q6y4Ixice9dDOJYynl/d
dMEUEfCBWUyDaQD1bNQIED2mc0sM5+Ax8OVIVx9fdrLGPxaISBqDJKE1USoTNZzm
q+U3dM4Q45SfQSsgaY4FtxTlPWPtUKsNMXY/CxLR+IqVeA3+30rX+hv1f3SAqBj0
68IwW/uUIHb71IZ+hz2Mwudt4JVW8KRg9VlZ0UY6EEvC4m5QD2YkwQQo/hJ2WF+/
wAJbku02L/Vy4J7E6hNcKYWXokht4fVYphjl/1ZDd/+8L8SUv9mC88n1Jzxa428p
1PmbtwzbpOrtTcC2BCPDA04IyfMc7k9DlLKw/h2KLPbHZXheD9eVmo/Am5vz+uH6
926f+FK339SzoJnZ5wBBDiZ8W8TLYNc8ImxtcxjnrtGfr1CKiuh23P1CWyOlKJCO
BYFJqkCOqWOHYnN0embaj7JqM/LmQI5ZoBZHZhD2KQXIXpTsjjIMPfJvip5D+tsV
+iXIlQwdeK6rbjxdonBgn7n57XIeSVMAYeyDCbzIShfibjHbUZPn+RsZCtv8RWv8
EaZu8PerU5ZDKwdX940+lWrtf/TMDJBYQpAIBRuiZK4DTNWCt3BrDlvb1FXGgA+X
vQJnr32vjJ/pLDxNLHQlkKWC4I/CYtG47OgcJN9AQXrig1zApd+C29zy3aqch3ea
wxV9dFfheTqZFjtZfSsH
=O/cf
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Lots of updates for net-next; along with the usual flurry
of small fixes, cleanups and internal features we have:
* VHT support for TDLS and IBSS (conditional on drivers though)
* first TX performance improvements (the biggest will come later)
* many suspend/resume (race) fixes
* name_assign_type support from Tom Gundersen
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Those are counterparts to nla_put_in_addr and nla_put_in6_addr.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP addresses are often stored in netlink attributes. Add generic functions
to do that.
For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
not used universally throughout the kernel, in way too many places __be32 is
used to store IPv4 address.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In many places, the a6 field is typecasted to struct in6_addr. As the
fields are in union anyway, just add in6_addr type to the union and
get rid of the typecasting.
Modifying the uapi header is okay, the union has still the same size.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In many places, the a6 field is typecasted to struct in6_addr. As the
fields are in union anyway, just add in6_addr type to the union and get rid
of the typecasting.
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-03-27
Here's another set of Bluetooth & 802.15.4 patches for 4.1:
- New API to control LE advertising data (i.e. peripheral role)
- mac802154 & at86rf230 cleanups
- Support for toggling quirks from debugfs (useful for testing)
- Memory leak fix for LE scanning
- Extra version info reading support for Broadcom controllers
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to shrink the size of bt_skb_cb, this patch moves the HCI
request related variables into their own req_ctrl struct. Additionall
the L2CAP and HCI request structs are placed inside the same union since
they will never be used at the same time for the same skb.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We're getting very close to the maximum possible size of bt_skb_cb. To
prepare to shrink the struct with the help of a union this patch moves
all L2CAP related variables into the l2cap_ctrl struct. To later add
other 'ctrl' structs the L2CAP one is renamed simple 'l2cap' instead
of 'control'.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Indicating just the peer's capability is fairly pointless
if the local device doesn't support it. Make the variable
track both combined, and remove the 'local support' check
in the TX path.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This will expose in /sys whether the ifname of a device is set by
userspace or generated by the kernel. The latter kind (wlanX, etc)
is not deterministic, so userspace needs to rename these devices
to names that are guaranteed to stay the same between reboots. The
former, however should never be renamed, so userspace needs to be
able to reliably tell the difference.
Similar functionality was introduced for the rtnetlink core in
commit 5517750f05 ("net: rtnetlink - make create_link take name_assign_type")
Signed-off-by: Tom Gundersen <teg@jklm.no>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Brett Rudley <brudley@broadcom.com>
Cc: Arend van Spriel <arend@broadcom.com>
Cc: Franky (Zhenhui) Lin <frankyl@broadcom.com>
Cc: Hante Meuleman <meuleman@broadcom.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
[reformat changelog to fit 72 cols]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Seems Broadcom TDLS peers (Nexus 5, Xperia Z3) refuse to allow TDLS
connection when channel-switching is supported but the regulatory
classes IE is missing from the setup request.
Add a chandef to reg-class translation function to cfg80211 and use it
to add the required IE during setup. For now add only the current
regulatory class as supported - it is enough to resolve the
compatibility issue.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can allow the driver to take action based on the reason
of the deauth.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can allow the driver to take action based on the
success / failure of the association.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can allow the driver to take action based on the
success / failure of the authentication.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We will be able to add more events, such as MLME events and
others. The low level driver may be interested in knowing
about these events to dump firmware data upon failures, or
to change parameters in case connection attempts fail etc...
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Provide callbacks for ndo_fdb_add, ndo_fdb_del, and ndo_fdb_dump.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:
1) Move netns to basechain and use recently added possible_net_t, from
Patrick McHardy.
2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.
3) Restore nf_log_trace that was accidentally removed during conflict
resolution.
4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
all patches from Patrick McHardy.
5) Use raw_smp_processor_id() in nft_meta.
Then, several patches to prepare ground for the new set extension
infrastructure:
6) Pass object length to the hash callback in rhashtable as needed by
the new set extension infrastructure.
7) Cleanup patch to restore struct nft_hash as wrapper for struct
rhashtable
8) Another small source code readability cleanup for nft_hash.
9) Convert nft_hash to rhashtable callbacks.
And finally...
10) Add the new set extension infrastructure.
11) Convert the nft_hash and nft_rbtree sets to use it.
12) Batch set element release to avoid several RCU grace period in a row
and add new function nft_set_elem_destroy() to consolidate set element
release.
13) Return the set extension data area from nft_lookup.
14) Refactor existing transaction code to add some helper functions
and document it.
15) Complete the set transaction support, using similar approach to what we
already use, to activate/deactivate elements in an atomic fashion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.
We can make it static, by moving it above tcp_conn_request()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set elements are the last object type not supporting transaction support.
Implement similar to the existing rule transactions:
The global transaction counter keeps track of two generations, current
and next. Each element contains a bitmask specifying in which generations
it is inactive.
New elements start out as inactive in the current generation and active
in the next. On commit, the previous next generation becomes the current
generation and the element becomes active. The bitmask is then cleared
to indicate that the element is active in all future generations. If the
transaction is aborted, the element is removed from the set before it
becomes active.
When removing an element, it gets marked as inactive in the next generation.
On commit the next generation becomes active and the therefor the element
inactive. It is then taken out of then set and released. On abort, the
element is marked as active for the next generation again.
Lookups ignore elements not active in the current generation.
The current set types (hash/rbtree) both use a field in the extension area
to store the generation mask. This (currently) does not require any
additional memory since we have some free space in there.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add some helper functions for building the genmask as preparation for
set transactions.
Also add a little documentation how this stuff actually works.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Return the extension area from the ->lookup() function to allow to
consolidate common actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With the conversion to set extensions, it is now possible to consolidate
the different set element destruction functions.
The set implementations' ->remove() functions are changed to only take
the element out of their internal data structures. Elements will be freed
in a batched fashion after the global transaction's completion RCU grace
period.
This reduces the amount of grace periods required for nft_hash from N
to zero additional ones, additionally this guarantees that the set
elements' extensions of all implementations can be used under RCU
protection.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A simple forward for firmware download (i.e. sending a new firmware
to the NFC adapter) from the NFC subsystem to the drivers.
This feature is required to update the firmware of NXP-NCI NFC
controllers but can be used by any NCI driver.
This feature has been present in the HCI subsystem since 9a695d.
Signed-off-by: Clément Perrochaud <clement.perrochaud@effinnov.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch adds macro definitions for possible advertising instance
flags that can be passed to the "Add Advertising" command.
Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The set implementations' private struct will only contain the elements
needed to maintain the search structure, all other elements are moved
to the set extensions.
Element allocation and initialization is performed centrally by
nf_tables_api instead of by the different set implementations'
->insert() functions. A new "elemsize" member in the set ops specifies
the amount of memory to reserve for internal usage. Destruction
will also be moved out of the set implementations by a following patch.
Except for element allocation, the patch is a simple conversion to
using data from the extension area.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add simple set extension infrastructure for maintaining variable sized
and optional per element data.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the declaration for external variables to sctp.h file avoiding
to repeatedly declare them with extern keyword.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network namespace is only needed for base chains to get at the
gencursor. Also convert to possible_net_t.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With request socks convergence, we no longer need
different lookup methods. A request socket can
use generic lookup function.
Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since request and established sockets now have same base,
there is no need to pass two pointers to tcp_v4_md5_hash_skb()
or tcp_v6_md5_hash_skb()
Also add a const qualifier to their struct tcp_md5sig_key argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/netfilter/nf_tables_core.c
The nf_tables_core.c conflict was resolved using a conflict resolution
from Stephen Rothwell as a guide.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is specified by RFC 7217.
Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a DAD conflict is detected, we want to retry privacy stable address
generation up to idgen_retries (= 3) times with a delay of idgen_delay
(= 1 second). Add the logic to addrconf_dad_failure.
By design, we don't clean up dad failed permanent addresses.
Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements support for the timeout parameter of the
Add Advertising command.
Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduces a new data structure to represent advertising
instances that were added using the "Add Advertising" mgmt command.
Initially an hci_dev structure will support only one of these instances
at a time, so the current instance is simply stored as a direct member
of hci_dev.
Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch introduces the HCI_ADVERTISING_INSTANCE setting, which is set
when an at least one advertising instance has been added using the
"Add Advertising" mgmt command. This patch also adds a macro definition
for the EIR_APPEARANCE field type.
Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds definitions for the Add Advertising and Remove
Advertising MGMT commands and events.
Signed-off-by: Arman Uguray <armansito@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.
Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.
New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.
We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Fix missing initialization of tuple structure in nfnetlink_cthelper
to avoid mismatches when looking up to attach userspace helpers to
flows, from Ian Wilson.
2) Fix potential crash in nft_hash when we hit -EAGAIN in
nft_hash_walk(), from Herbert Xu.
3) We don't need to indicate the hook information to update the
basechain default policy in nf_tables.
4) Restore tracing over nfnetlink_log due to recent rework to
accomodate logging infrastructure into nf_tables.
5) Fix wrong IP6T_INV_PROTO check in xt_TPROXY.
6) Set IP6T_F_PROTO flag in nft_compat so we can use SYNPROXY6 and
REJECT6 from xt over nftables.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We send unicast neighbor (ARP or NDP) solicitations ucast_probes
times in PROBE state. Zhu Yanjun reported that some implementation
does not reply against them and the entry will become FAILED, which
is undesirable.
We had been dealt with such nodes by sending multicast probes mcast_
solicit times after unicast probes in PROBE state. In 2003, I made
a change not to send them to improve compatibility with IPv6 NDP.
Let's introduce per-protocol per-interface sysctl knob "mcast_
reprobe" to configure the number of multicast (re)solicitation for
reconfirmation in PROBE state. The default is 0, since we have
been doing so for 10+ years.
Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
CC: Ulf Samuelsson <ulf.samuelsson@ericsson.com>
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work extends the "classic" BPF programmable tc action by extending
its scope also to native eBPF code!
Together with commit e2e9b6541d ("cls_bpf: add initial eBPF support
for programmable classifiers") this adds the facility to implement fully
flexible classifier and actions for tc that can be implemented in a C
subset in user space, "safely" loaded into the kernel, and being run in
native speed when JITed.
Also, since eBPF maps can be shared between eBPF programs, it offers the
possibility that cls_bpf and act_bpf can share data 1) between themselves
and 2) between user space applications. That means that, f.e. customized
runtime statistics can be collected in user space, but also more importantly
classifier and action behaviour could be altered based on map input from
the user space application.
For the remaining details on the workflow and integration, see the cls_bpf
commit e2e9b6541d. Preliminary iproute2 part can be found under [1].
[1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-act
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c
The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().
So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.
The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
listen() backlog was limited to 65535.
It is time to increase the width to allow much bigger backlog,
if admins change /proc/sys/net/core/somaxconn &
/proc/sys/net/ipv4/tcp_max_syn_backlog default values.
Tested:
echo 5000000 >/proc/sys/net/core/somaxconn
echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog
Ran a SYNFLOOD test against a listener using listen(fd, 5000000)
myhost~# grep request_sock_TCP /proc/slabinfo
request_sock_TCP 4185642 4411940 304 13 1 : tunables 54 27 8 : slabdata 339380 339380 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.
This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.
SYNACK are sent in huge bursts, likely to cause severe drops anyway.
This model was OK 15 years ago when memory was very tight.
We now can afford to have a timer per request sock.
Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.
With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.
This is ~100 times more what we could achieve before this patch.
Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.
Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-03-19
This wont the last 4.1 bluetooth-next pull request, but we've piled up
enough patches in less than a week that I wanted to save you from a
single huge "last-minute" pull somewhere closer to the merge window.
The main changes are:
- Simultaneous LE & BR/EDR discovery support for HW that can do it
- Complete LE OOB pairing support
- More fine-grained mgmt-command access control (normal user can now do
harmless read-only operations).
- Added RF power amplifier support in cc2520 ieee802154 driver
- Some cleanups/fixes in ieee802154 code
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since fab4085 ("netfilter: log: nf_log_packet() as real unified
interface"), the loginfo structure that is passed to nf_log_packet() is
used to explicitly indicate the logger type you want to use.
This is a problem for people tracing rules through nfnetlink_log since
packets are always routed to the NF_LOG_TYPE logger after the
aforementioned patch.
We can fix this by removing the trace loginfo structures, but that still
changes the log level from 4 to 5 for tracing messages and there may be
someone relying on this outthere. So let's just introduce a new
nf_log_trace() function that restores the former behaviour.
Reported-by: Markus Kötter <koetter@rrzn.uni-hannover.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
in favor of their inner __ ones, which doesn't grab rtnl.
As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.
So this patch:
- move rtnl handling to callers instead while already fixing some
reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
__ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
__ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We now always call __inet_hash_nolisten(), no need to pass it
as an argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can now use inet_hash() and __inet_hash() instead of private
functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.
IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.
__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers
for all kind of sockets (full sockets, timewait and request sockets)
inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
const qualifiers ease code review by making clear
which objects are not written in a function.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing last patch series, I found req sock refcounting was wrong.
We must set skc_refcnt to 1 for all request socks added in hashes,
but also on request sockets created by FastOpen or syncookies.
It is tricky because we need to defer this initialization so that
future RCU lookups do not try to take a refcount on a not yet
fully initialized request socket.
Also get rid of ireq_refcnt alias.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 13854e5a60 ("inet: add proper refcounting to request sock")
Signed-off-by: David S. Miller <davem@davemloft.net>
Once we'll be able to lookup request sockets in ehash table,
we'll need to get access to listener which created this request.
This avoid doing a lookup to find the listener, which benefits
for a more solid SO_REUSEPORT, and is needed once we no
longer queue request sock into a listener private queue.
Note that 'struct tcp_request_sock'->listener could be reduced
to a single bit, as TFO listener should match req->rsk_listener.
TFO will no longer need to hold a reference on the listener.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_reqsk_alloc() is becoming fat and should not be inlined.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
listener socket can be used to set net pointer, and will
be later used to hold a reference on listener.
Add a const qualifier to first argument (struct request_sock_ops *),
and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_oow_rate_limited() is hardly used in fast path, there is
no point inlining it.
Signed-of-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This big helper is called once from tcp_conn_request(), there is no
point having it in an include. Compiler will inline it anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On 64bit arches, we can save 8 bytes in inet_request_sock
by moving ir_mark to fill a hole.
While we are at it, inet_request_mark() can get a const qualifier
for listener socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mgmt.c file should be reserved purely for HCI_CHANNEL_CONTROL. The
mgmt_control() function in it is already completely generic and has a
single user in hci_sock.c. This patch moves the function there and
renames it a bit more appropriately to hci_mgmt_cmd() (as it's a command
dispatcher).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In order to make the mgmt command handling more generic we can't have a
direct call to mgmt_init_hdev() from mgmt_control(). This patch adds a
new callback to struct hci_mgmt_chan. And sets it to point to the
mgmt_init_hdev() function for the HCI_CHANNEL_CONTROL instance.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We'll need to have access to which HCI channel a socket is bound to, in
order to manage pending mgmt commands in clean way. This patch adds a
helper for the purpose.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Some controllers allow both LE scan and BR/EDR inquiry to run at
the same time, while others allow only one, LE SCAN or BR/EDR
inquiry at given time.
Since this is specific to each controller, add a new quirk setting
that allows drivers to tell the core wether given controller can
do both LE scan and BR/EDR inquiry at same time.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When a different user requests a new set of local out-of-band data, then
inform all previous users that the data has been updated. To limit the
scope of users, the updates are limited to previous users. If a user has
never requested out-of-band data, it will also not see the update.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2015-03-16
1) Fix the network header offset in _decode_session6
when multiple IPv6 extension headers are present.
From Hajime Tazaki.
2) Fix an interfamily tunnel crash. We set outer mode
protocol too early and may dispatch to the wrong
address family. Move the setting of the outer mode
protocol behind the last accessing of the inner mode
to fix the crash.
3) Most callers of xfrm_lookup() expect that dst_orig
is released on error. But xfrm_lookup_route() may
need dst_orig to handle certain error cases. So
introduce a flag that tells what should be done in
case of error. From Huaibin Wang.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
reqsk_put() is the generic function that should be used
to release a refcount (and automatically call reqsk_free())
reqsk_free() might be called if refcount is known to be 0
or undefined.
refcnt is set to one in inet_csk_reqsk_queue_add()
As request socks are not yet in global ehash table,
I added temporary debugging checks in reqsk_put() and reqsk_free()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have many places where we want to check if a socket is
not a timewait or request socket. Use a helper to avoid
hard coding this.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LE Secure Connections Confirmation Value and LE Secure Connections
Random Value contants are required for the out-of-band data and so
just define them.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This will allow mac80211 drivers to call cfg80211 APIs with
the right handle.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The HCI_CONN_REMOTE_OOB connection flag is used to indicate if the
pairing initiator has provided out-of-band data. However since that
value is no longer used in any decision making, just remove it.
It is actually unclear what purpose the OOB data present field from
the HCI IO Capability Response event serves in the first place. If
either side provided out-of-band data, then that data will be used
for pairing.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
As discussed at netconf, introduce swdev_ops as first step to move switchdev
ops from ndo to swdev. This will keep switchdev from cluttering up ndo ops
space.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the simplest possible version of Read Local OOB
Extended Data management command. It includes all mandatory fields,
but none of the actual pairing related ones.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The OOB data requires to include LE Bluetooth Device Address and LE Role
and so add the type constants for these fields.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This adds support for the simplest possible version of Read Advertising
Features management command. It allows basic testing of the interface.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The flags for the management command table used manual encoding of
bits in the form of (1 << n). It is however preferred to use BIT(n)
macro instead.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Changes to the global configuration updates like settings, class of
device, name etc. can be received by every user. They are allowed to
read them in the first place so provide the updates via events as
well. Otherwise untrusted users start polling for updates and that
is not a desired behavior.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Check the required trust level of each management command with the trust
level of the management socket. If it does not match up, then return the
newly introduced permission denied error.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Some management commands are safe to be accessed from any user without
special permissions. First step for allowing access to any of these
commands from untrusted application is to mark them accordingly.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The management interface will need access to the socket flags and so
provide a helper function for checking them.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
With the introduction of trusted socket flag for control and monitor
channels, it is now possible to use a single function for sending
packets to these sockets. And with that consolidate the handling.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Providing a global trusted flag for management control sockets provides
an easy way for identifying sockets and imposing restriction on it. For
now all management sockets are trusted since they require CAP_NET_ADMIN.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The Read Extended Contoller Index List command can be used for
retrieving the complete list of local available controllers. This
included configured, unconfigured and also AMP controllers.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This introduces support for using Extended Index Added and Extended
Index Removed events. These events contain the controller type and
also the hardware bus information from the driver.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
For sending Index Added, Index Removed, Unconfigured Index Added and
Unconfigured Index Removed managment events the new helper functions
allows taking into account if these events are enabled for a certain
management socket or not.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The hci_send_to_flagged_channel helper function can be used to send
packets to all channels that have a certain HCI socket flag set.
This is especially useful for managment events that are limited to
sockets that have first enabled certain functionality. This allows
for filtering of events without confusing existing users.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
To filter out certain actions for certain HCI sockets introcuce a flags
field that allows to configure specific settings on individual sockets.
Since the hci_pinfo structure is private in hci_sock.c, provide helper
functions for setting and clearing a given flag.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Johan Hedberg says:
====================
Here's another set of Bluetooth & ieee802154 patches intended for 4.1:
- Added support for QCA ROME chipset family in the btusb driver
- at86rf230 driver fixes & cleanups
- ieee802154 cleanups
- Refactoring of Bluetooth mgmt API to allow new users
- New setting for static Bluetooth address exposed to user space
- Refactoring of hci_dev flags to remove limit of 32
- Remove unnecessary fast-connectable setting usage restrictions
- Fix behavior to be consistent when trying to pair already paired device
- Service discovery corner-case fixes
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With the extension of hdev->dev_flags utilizing a bitmap now, the space
is no longer restricted. Merge the hdev->dbg_flags into hdev->dev_flags
to save space on 64-bit architectures. On 32-bit architectures no size
reduction happens.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
commit dfd8645ea1 wrongly assumes that VXLAN_VDI_MASK includes
eight lower order reserved bits of VNI field that are using for remote
checksum offload.
Right now, when VNI number greater then 0xffff, vxlan_udp_encap_recv()
will always return with 'bad_flag' error, reducing the usable vni range
from 0..16777215 to 0..65535. Also, it doesn't really check whether RCO
bits processed or not.
Fix it by adding new VNI mask which has all 32 bits of VNI field:
24 bits for id and 8 bits for other usage.
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hdev->dev_flags field has outgrown itself on 32-bit systems. So
instead of hacking around it, switch to using DECLARE_BITMAP.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding test_and_set_bit on hdev->dev_flags all the
time, use hci_dev_test_and_set_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding test_and_clear_bit on hdev->dev_flags all the
time, use hci_dev_test_and_clear_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding test_and_change_bit on hdev->dev_flags all the
time, use hci_dev_test_and_change_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding change_bit on hdev->dev_flags all the time,
use hci_dev_change_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding clear_bit on hdev->dev_flags all the time,
use hci_dev_clear_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding set_bit on hdev->dev_flags all the time,
use hci_dev_set_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Instead of manually coding test_bit on hdev->dev_flags all the time,
use hci_dev_test_flag helper macro.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The patch adds a second advertising setting that allows switching of the
controller into connectable mode independent of the global connectable
setting.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before inserting request socks into general hash table,
fill their socket family.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_edemux() & sock_gen_put() should be ready to cope with request socks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When request socks will be in ehash, they'll need to be refcounted.
This patch adds rsk_refcnt/ireq_refcnt macros, and adds
reqsk_put() function, but nothing yet use them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to identify request sock when they'll be visible in
global ehash table.
ireq_state is an alias to req.__req_common.skc_state.
Its value is set to TCP_NEW_SYN_RECV
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_SYN_RECV state is currently used by fast open sockets.
Initial TCP requests (the pseudo sockets created when a SYN is received)
are not yet associated to a state. They are attached to their parent,
and the parent is in TCP_LISTEN state.
This commit adds TCP_NEW_SYN_RECV state, so that we can convert
TCP stack to a different schem gradually.
This state is not exported to user space.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie
Lets clear ireq->ir_cookie in inet_reqsk_alloc()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe ("net: add real socket cookies")
Signed-off-by: David S. Miller <davem@davemloft.net>
Having to say
> #ifdef CONFIG_NET_NS
> struct net *net;
> #endif
in structures is a little bit wordy and a little bit error prone.
Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
> struct net *net;
> #endif
> } possible_net_t;
And then in a header say:
> possible_net_t net;
Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.
Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.
This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008. Kill the code it is long past due.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flags are used in the return path rather than the return patch.
Fixes: af33c1adae ("vxlan: Eliminate dependency on UDP socket in transmit path")
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.
1) It is a security concern.
2) Sockets can be reused quite quickly, so there is
no guarantee a cookie is used once and identify
a flow.
3) request sock, establish sock, and timewait socks
for a given flow have different cookies.
Part of our effort to bring better TCP statistics requires
to switch to a different allocator.
In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)
Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to collapse local and main into one by converting
tb_data from an array to a pointer. Doing this allows us to point the
local table into the main while maintaining the same variables in the
table.
As such the tb_data was converted from an array to a pointer, and a new
array called data is added in order to still provide an object for tb_data
to point to.
In order to track the origin of the fib aliases a tb_id value was added in
a hole that existed on 64b systems. Using this we can also reverse the
merge in the event that custom FIB rules are enabled.
With this patch I am seeing an improvement of 20ns to 30ns for routing
lookups as long as custom rules are not enabled, with custom rules enabled
we fall back to split tables and the original behavior.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make the behavior predictable when attempting to pair with a device
for which we already have a Link Key or Long Term Key, this patch adds a
new 'Already Paired' error which gets sent in such a scenario.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To maximize the usability of the Fast Connectable feature we should make
it possible to set (or unset) it at any given moment. This means
removing the dependency on the 'connectable' setting as well as the
'powered' setting. The former makes also sense since page scan may get
enabled through add_device even if 'connectable' is false. To keep the
setting available over power cycles its flag also needs to be removed
from the flags that are cleared upon HCI_Reset.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net-next
The following batch contains a couple of fixes to address some fallout
from the previous pull request, they are:
1) Address link problems in the bridge code after e5de75b. Fix it by
using rcu hook to address to avoid ifdef pollution and hard
dependency between bridge and br_netfilter.
2) Address sparse warnings in the netfilter reject code, patch from
Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
make C=1 CF=-D__CHECK_ENDIAN__ shows following:
net/bridge/netfilter/nft_reject_bridge.c:65:50: warning: incorrect type in argument 3 (different base types)
net/bridge/netfilter/nft_reject_bridge.c:65:50: expected restricted __be16 [usertype] protocol [..]
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: cast from restricted __be16
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: incorrect type in argument 1 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:121:50: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:168:52: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:233:52: warning: incorrect type in argument 3 (different base types) [..]
Caused by two (harmless) errors:
1. htons() instead of ntohs()
2. __be16 for protocol in nf_reject_ipXhdr_put API, use u8 instead.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass in the netlink flags (NLM_F_*) into switchdev driver for IPv4 FIB add op
to allow driver to 1) optimize hardware updates, 2) handle ip route prepend
and append commands correctly.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using of_find_device_by_node() restricts the search to platform_device that
match the specified device_node pointer. This is not even remotely true for
network devices backed by a pci_device for instance.
of_find_net_device_by_node() allows us to do a more thorough lookup to find the
struct net_device corresponding to a particular device_node pointer.
For symetry with the non-OF code path, we hold the net_device pointer in
dsa_probe() just like what dev_to_net_dev() does when we call this
function.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cadence/macb.c
Overlapping changes in macb driver, mostly fixes and cleanups
in 'net' overlapping with the integration of at91_ether into
macb in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
After my change to neigh_hh_init to obtain the protocol from the
neigh_table there are no more users of protocol in struct dst_ops.
Remove the protocol field from dst_ops and all of it's initializers.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Basically, improvements for the packet rejection infrastructure,
deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for
br_netfilter. More specifically they are:
1) Send packet to reset flow if checksum is valid, from Florian Westphal.
2) Fix nf_tables reject bridge from the input chain, also from Florian.
3) Deprecate the CLUSTERIP target, the cluster match supersedes it in
functionality and it's known to have problems.
4) A couple of cleanups for nf_tables rule tracing infrastructure, from
Patrick McHardy.
5) Another cleanup to place transaction declarations at the bottom of
nf_tables.h, also from Patrick.
6) Consolidate Kconfig dependencies wrt. NF_TABLES.
7) Limit table names to 32 bytes in nf_tables.
8) mac header copying in bridge netfilter is already required when
calling ip_fragment(), from Florian Westphal.
9) move nf_bridge_update_protocol() to br_netfilter.c, also from
Florian.
10) Small refactor in br_netfilter in the transmission path, again from
Florian.
11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel automatically creates a tp for each
(kind, protocol, priority) tuple, which has handle 0,
when we add a new filter, but it still is left there
after we remove our own, unless we don't specify the
handle (literally means all the filters under
the tuple). For example this one is left:
# tc filter show dev eth0
filter parent 8001: protocol arp pref 49152 basic
The user-space is hard to clean up these for kernel
because filters like u32 are organized in a complex way.
So kernel is responsible to remove it after all filters
are gone. Each type of filter has its own way to
store the filters, so each type has to provide its
way to check if all filters are gone.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove a little bit of unnecessary work when transmitting a packet with
neigh_packet_xmit. Use the neighbour table index not the address family
as a parameter.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As specified in 802.1Qau spec. Add this optional attribute to the
DCB netlink layer. To allow for application to use the new attribute,
NIC drivers should implement and register the callbacks ieee_getqcn,
ieee_setqcn and ieee_getqcnstats.
The QCN attribute holds a set of parameters for management, and
a set of statistics to provide informative data on Congestion-Control
defined by this spec.
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building without CONFIG_NET_SWITCHDEV,
netdev_switch_fib_ipv4_abort is defined in the header file. It must
be static inline to avoid build failure at link time.
Fixes: 8e05fd7166 ("fib: hook IPv4 fib for hardware offload")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As per RFC4821 7.3. Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.
Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.
Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.
In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.
Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24 -t 60
before this patch:
1.26 Gbits/sec
After this patch: increase 26%
1.59 Gbits/sec
Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Quotes from RFC4821 7.2. Selecting Initial Values
It is RECOMMENDED that search_low be initially set to an MTU size
that is likely to work over a very wide range of environments. Given
today's technologies, a value of 1024 bytes is probably safe enough.
The initial value for search_low SHOULD be configurable.
Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.
Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other users users of the neighbour table use neigh->output as the method
to decided when and which link-layer header to place on a packet.
DECnet has been using neigh->output to decide which DECnet headers to
place on a packet depending which neighbour the packet is destined for.
The DECnet usage isn't totally wrong but it can run into problems if the
neighbour output function is run for a second time as the teql driver
and the bridge netfilter code can do.
Therefore to avoid pathologic problems later down the line and make the
neighbour code easier to understand by refactoring the decnet output
code to only use a neighbour method to add a link layer header to a
packet.
This is done by moving the neigbhour operations lookup from
dn_to_neigh_output to dn_neigh_output_packet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to completely generalize the mgmt command handling we need to
move away command-specific information from mgmt_control() into the
actual command table. This patch adds a new 'flags' field to the handler
entries which can now contain the following command specific
information:
- Command takes variable length parameters
- Command doesn't target any specific HCI device
- Command can be sent when the HCI device is unconfigured
After this the mgmt_control() function is completely generic and can
potentially be reused by new HCI channels.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch converts the existing mgmt code to use the newly introduced
generic API for registering HCI channels with mgmt-like semantics.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds an API for registering HCI channels with mgmt-like
semantics. For now the only user will be HCI_CHANNEL_CONTROL, but e.g.
6lowpan is intended to use this as well in the future.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently it is not possible to determine if the static address is used
by the controller. It is also not possible to determine if using a
static on a dual-mode controller with disabled BR/EDR is possible or
not.
To address this issue, introduce a new setting called static-address. If
support for this setting is signaled that means that the kernel supports
using static addresses. And if used on dual-mode controllers with BR/EDR
disabled it means that a configured static address can be used.
In addition utilize the same setting for the list of current active
settings that indicates if a static address is configured and if that
address will be actually used.
With this in mind the existing Set Static Address management command
has been extended to return the current settings. That way the caller
of that command can easily determine if the programmed address will
be used or if extra steps are required.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB. The switchdev driver may or
may not install the route to the offload device. In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, route forwarding falls back to the kernel, and no more routes are
offloading.
We can refine this logic later. For now, use the simplist model of offloading
routes up to the point of failure, and then on failure, undo everything and
mark IPv4 offloading disabled.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If something goes wrong with IPv4 FIB offload, mark entire net offload
disabled. This is brute force policy to basically shut down IPv4 FIB offload
permanently if there is a problem offloading any route to an external device.
We can refine the policy in the future, to handle failures on a per-device or
per-route basis, but for now, this policy is per-net.
What we're trying to avoid is an inconsistent split between the kernel's FIB
and the offload device's FIB. We don't want the device to fwd a pkt
inconsitent with what the kernel would do. An example of a split is if device
has 10.0.0.0/16 and kernel has 10.0.0.0/16 and 10.0.0.0/24, the device wouldn't
see the longest prefix 10.0.0.0/24 and potentially forward pkts incorrectly.
Limited capacity or limited capability are two ways a route may fail to install
to the offload device. We'll not differentiate between failures at this time,
and treat any failure as fatal and mark the net as fib_offload_disabled.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add IPv4 fib ndo wrapper funcs and stub them out for now.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support the new DSA device driver model, a dsa_switch should
be able to advertise the type of tagging protocol supported by the
underlying switch device. This also removes constraints on how tagging
can be stacked to each other.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The transaction related definitions are squeezed in between the rule
and expression definitions, which are closely related and should be
next to each other. The transaction definitions actually don't belong
into that file at all since it defines the global objects and API and
transactions are internal to nf_tables_api, but for now simply move
them to a seperate section.
Similar, the chain types are in between a set of registration functions,
they belong to the chain section.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xt_cluster supersedes ipt_CLUSTERIP since it can be also used in
gateway configurations (not only from the backend side).
ipt_CLUSTER is also known to leak the netdev that it uses on
device removal, which requires a rather large fix to workaround
the problem: http://patchwork.ozlabs.org/patch/358629/
So let's deprecate this so we can probably kill code this in the
future.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes service discovery behaviour, when provided uuid filter
is empty and HCI_QUIRK_STRICT_DUPLICATE_FILTER is set. Before this
patch, empty uuid filter was unable to trigger scan restart, and that
caused inconsistent behaviour in applications.
Example: two DBus clients call BlueZ, one to find all devices with
service abcd, second to find all devices with rssi smaller than -90.
Sum of those filters, that is passed to mgmt_service_scan is empty
filter, with no rssi or uuids set.
That caused kernel not to restart scan when quirk was set.
That was inconsistent with what happen when there's only one of those
two filters set (scan is restarted and reports devices).
To fix that, new variable hdev->discovery.result_filtering was
introduced. It can indicate that filtered scan is running, no matter
what uuid or rssi filter is set.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The fib_table was wrapped in several places with an
rcu_read_lock/rcu_read_unlock however after looking over the code I found
several spots where the tables were being accessed as just standard
pointers without any protections. This change fixes that so that all of
the proper protections are in place when accessing the table to take RCU
replacement or removal of the table into account.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NFT_USERDATA_MAXLEN is defined to 256, however we only have a u8
to store its size. Introduce a struct nft_userdata which contains a
length field and indicate its presence using a single bit in the rule.
The length field of struct nft_userdata is also a u8, however we don't
store zero sized data, so the actual length is udata->len + 1.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Some device drivers offload part of aggregation including AddBA/DelBA
negotiations to firmware. In such scenario, the PMF configuration of
the station needs to be provided to driver to enable encryption of
AddBA/DelBA action frames.
Signed-off-by: SenthilKumar Jegadeesan <sjegadee@qti.qualcomm.com>
[fix commit log, documentation]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sometimes the driver might want to modify private data in interfaces
that are down. One possible use-case is cleaning up interface state
after HW recovery. Some interfaces that were up before the recovery took
place might be down now, but they might still be "dirty".
Introduce a new iterate_interfaces() API and a new ACTIVE iterator flag.
This way the internal implementation of the both active and inactive
APIs remains the same.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This sysctl gives two benefits. By defaulting the table size to 0
mpls even when compiled in and enabled defaults to not forwarding
any packets. This prevents unpleasant surprises for users.
The other benefit is that as mpls labels are allocated locally a dense
table a small dense label table may be used which saves memory and
is extremely simple and efficient to implement.
This sysctl allows userspace to choose the restrictions on the label
table size userspace applications need to cope with.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds a new Kconfig option MPLS_ROUTING.
The core of this change is the code to look at an mpls packet received
from another machine. Look that packet up in a routing table and
forward the packet on.
Support of MPLS over ATM is not considered or attempted here. This
implemntation follows RFC3032 and implements the MPLS shim header that
can pass over essentially any network.
What RFC3021 refers to as the as the Incoming Label Map (ILM) I call
net->mpls.platform_label[]. What RFC3031 refers to as the Next Label
Hop Forwarding Entry (NHLFE) I call mpls_route. Though calling it the
label fordwarding information base (lfib) might also be valid.
Further the implemntation forwards packets as described in RFC3032.
There is no need and given the original motivation for MPLS a strong
discincentive to have a flexible label forwarding path. In essence
the logic is the topmost label is read, looked up, removed, and
replaced by 0 or more new lables and the sent out the specified
interface to it's next hop.
Quite a few optional features are not implemented here. Among them
are generation of ICMP errors when the TTL is exceeded or the packet
is larger than the next hop MTU (those conditions are detected and the
packets are dropped instead of generating an icmp error). The traffic
class field is always set to 0. The implementation focuses on IP over
MPLS and does not handle egress of other kinds of protocols.
Instead of implementing coordination with the neighbour table and
sorting out how to input next hops in a different address family (for
which there is value). I was lazy and implemented a next hop mac
address instead. The code is simpler and there are flavor of MPLS
such as MPLS-TP where neither an IPv4 nor an IPv6 next hop is
appropriate so a next hop by mac address would need to be implemented
at some point.
Two new definitions AF_MPLS and PF_MPLS are exposed to userspace.
Decoding the mpls header must be done by first byeswapping a 32bit bit
endian word into the local cpu endian and then bit shifting to extract
the pieces. There is no C bit-field that can represent a wire format
mpls header on a little endian machine as the low bits of the 20bit
label wind up in the wrong half of third byte. Therefore internally
everything is deal with in cpu native byte order except when writing
to and reading from a packet.
For management simplicity if a label is configured to forward out
an interface that is down the packet is dropped early. Similarly
if an network interface is removed rt_dev is updated to NULL
(so no reference is preserved) and any packets for that label
are dropped. Keeping the label entries in the kernel allows
the kernel label table to function as the definitive source
of which labels are allocated and which are not.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For MPLS I am building the code so that either the neighbour mac
address can be specified or we can have a next hop in ipv4 or ipv6.
The kind of next hop we have is indicated by the neighbour table
pointer. A neighbour table pointer of NULL is a link layer address.
A non-NULL neighbour table pointer indicates which neighbour table and
thus which address family the next hop address is in that we need to
look up.
The code either sends a packet directly or looks up the appropriate
neighbour table entry and sends the packet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at the mpls code I found myself writing yet another
version of neigh_lookup_noref. We currently have __ipv4_lookup_noref
and __ipv6_lookup_noref.
So to make my work a little easier and to make it a smidge easier to
verify/maintain the mpls code in the future I stopped and wrote
___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and
__ipv6_lookup_noref in terms of this new function. I tested my new
version by verifying that the same code is generated in
ip_finish_output2 and ip6_finish_output2 where these functions are
inlined.
To get to ___neigh_lookup_noref I added a new neighbour cache table
function key_eq. So that the static size of the key would be
available.
I also added __neigh_lookup_noref for people who want to to lookup
a neighbour table entry quickly but don't know which neibhgour table
they are going to look up.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/rocker/rocker.c
The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Before the ax25 stack calls dev_queue_xmit it always calls
ax25_type_trans which sets skb->protocol to ETH_P_AX25.
Which means that by looking at the protocol type it is possible to
detect IP packets that have not been munged by the ax25 stack in
ndo_start_xmit and call a function to munge them.
Rename ax25_neigh_xmit to ax25_ip_xmit and tweak the return type and
value to be appropriate for an ndo_start_xmit function.
Update all of the ax25 devices to test the protocol type for ETH_P_IP
and return ax25_ip_xmit as the first thing they do. This preserves
the existing semantics of IP packet processing, but the timing will be
a little different as the IP packets now pass through the qdisc layer
before reaching the ax25 ip packet processing.
Remove the now unnecessary ax25 neighbour table operations.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Beacon's timestamp, device system time associated with this beacon and
DTIM count parameters are not updated in the associated vif context
if the latest beacon's content is identical to the previously received.
It make sense to update these changing parameters on every beacon so the
driver can get most updated values. This may be necessary, for example,
to avoid either beacons' drift effect or device time stamp overrun.
IMPORTANT: Three sync_* parameters - sync_ts, sync_device_ts and
sync_dtim_count would possibly be out of sync by the time the driver will
use them. The synchronized view is currently guaranteed only in certain
callbacks.
Signed-off-by: Alexander Bondar <alexander.bondar@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This modifies cfg80211_vendor_event_alloc() with an additional argument
struct wireless_dev *wdev. __cfg80211_alloc_event_skb() is modified to
take in *wdev argument, if wdev != NULL, both the NL80211_ATTR_IFINDEX
and wdev identifier are added to the vendor event.
These changes make it easier for drivers to add ifindex indication in
vendor events cleanly.
This also updates all existing users of cfg80211_vendor_event_alloc()
and __cfg80211_alloc_event_skb() in the kernel tree.
Signed-off-by: Ahmad Kholaif <akholaif@qca.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
802.11ad adds new a network type (PBSS) and changes the capability
field interpretation for the DMG (60G) band.
The same 2 bits that were interpreted as "ESS" and "IBSS" before are
re-used as a 2-bit field with 3 valid values (and 1 reserved). Valid
values are: "IBSS", "PBSS" (new) and "AP".
In order to get the BSS struct for the new PBSS networks, change the
cfg80211_get_bss() function to take a new enum ieee80211_bss_type
argument with the valid network types, as "capa_mask" and "capa_val"
no longer work correctly (the search must be band-aware now.)
The remaining bits in "capa_mask" and "capa_val" are used only for
privacy matching so replace those two with a privacy enum as well.
Signed-off-by: Dedy Lansky <dlansky@codeaurora.org>
[rewrite commit log, tiny fixes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
tcp resets are never emitted if the packet that triggers the
reject/reset has an invalid checksum.
For icmp error responses there was no such check.
It allows to distinguish icmp response generated via
iptables -I INPUT -p udp --dport 42 -j REJECT
and those emitted by network stack (won't respond if csum is invalid,
REJECT does).
Arguably its possible to avoid this by using conntrack and only
using REJECT with -m conntrack NEW/RELATED.
However, this doesn't work when connection tracking is not in use
or when using nf_conntrack_checksum=0.
Furthermore, sending errors in response to invalid csums doesn't make
much sense so just add similar test as in nf_send_reset.
Validate csum if needed and only send the response if it is ok.
Reference: http://bugzilla.redhat.com/show_bug.cgi?id=1169829
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev
This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no more callers so kill this function.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only caller is now is ax25_neigh_construct so move
neigh_compat_output into ax25_ip.c make it static and rename it
ax25_neigh_output.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AX25 already has it's own private arp cache operations to isolate
it's abuse of dev_rebuild_header to transmit packets. Add a function
ax25_neigh_construct that will allow all of the ax25 devices to
force using these operations, so that the generic arp code does
not need to.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user is in ax25_ip.c so stop exporting these functions.
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-hams@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
A small batch with accumulated updates in nf-next, mostly IPVS updates,
they are:
1) Add 64-bits stats counters to IPVS, from Julian Anastasov.
2) Move NETFILTER_XT_MATCH_ADDRTYPE out of NETFILTER_ADVANCED as docker
seem to require this, from Anton Blanchard.
3) Use boolean instead of numeric value in set_match_v*(), from
coccinelle via Fengguang Wu.
4) Allows rescheduling of new connections in IPVS when port reuse is
detected, from Marcelo Ricardo Leitner.
5) Add missing bits to support arptables extensions from nft_compat,
from Arturo Borrero.
Patrick is preparing a large batch to enhance the set infrastructure,
named expressions among other things, that should follow up soon after
this batch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-03-02
Here's the first bluetooth-next pull request targeting the 4.1 kernel:
- ieee802154/6lowpan cleanups
- SCO routing to host interface support for the btmrvl driver
- AMP code cleanups
- Fixes to AMP HCI init sequence
- Refactoring of the HCI callback mechanism
- Added shutdown routine for Intel controllers in the btusb driver
- New config option to enable/disable Bluetooth debugfs information
- Fix for early data reception on L2CAP fixed channels
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.
Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 977750076d ("af_packet: add interframe drop cmsg (v6)")
unionized skb->mark and skb->dropcount in order to allow recording
of the socket drop count while maintaining struct sk_buff size.
skb->dropcount was introduced since there was no available room
in skb->cb[] in packet sockets. However, its introduction led to
the inability to export skb->mark, or any other aliased field to
userspace if so desired.
Moving the dropcount metric to skb->cb[] eliminates this problem
at the expense of 4 bytes less in skb->cb[] for protocol families
using it.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of an effort to move skb->dropcount to skb->cb[], use
a common function in order to set dropcount in struct sk_buff.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert boolean fields incoming and req_start to bit fields and move
force_active in order save space in bt_skb_cb in an effort to use
a portion of skb->cb[] for storing skb->dropcount.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct hci_req_ctrl is never used outside of struct bt_skb_cb;
Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing
the addition of more ancillary data.
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These checked wrappers are necessary for the next patch, which
will use them to avoid sending out partial scan results.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There isn't any advantage to having it as a list and by making it an hlist
we make the fib_alias more compatible with the list_info in terms of the
type of list used.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joining multicast group on ethernet level via "ip maddr" command would
not work if we have an Ethernet switch that does igmp snooping since
the switch would not replicate multicast packets on ports that did not
have IGMP reports for the multicast addresses.
Linux vxlan interfaces created via "ip link add vxlan" have the group option
that enables then to do the required join.
By extending ip address command with option "autojoin" we can get similar
functionality for openvswitch vxlan interfaces as well as other tunneling
mechanisms that need to receive multicast traffic. The kernel code is
structured similar to how the vxlan driver does a group join / leave.
example:
ip address add 224.1.1.10/24 dev eth5 autojoin
ip address del 224.1.1.10/24 dev eth5
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on the igmp v4 changes from Eric Dumazet.
959d10f6bbf6("igmp: add __ip_mc_{join|leave}_group()")
These changes are needed to perform igmp v6 join/leave while
RTNL is held.
Make ipv6_sock_mc_join and ipv6_sock_mc_drop wrappers around
__ipv6_sock_mc_join and __ipv6_sock_mc_drop to avoid
proliferation of work queues.
Signed-off-by: Madhu Challa <challa@noironetworks.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the unlikely event that skb_get_hash is unable to deduce a hash
in udp_flow_src_port we use a consistent random value instead.
This is specified in GRE/UDP draft section 3.2.1:
https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'master' parameter of the New CSRK event was recently renamed to
'type', with the old values kept for backwards compatibility as
unauthenticated local/remote keys. This patch updates the code to take
into account the two new (authenticated) values and ensures they get
used based on the security level of the connection that the respective
keys get distributed over.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To avoid race conditions when using the ds->ports[] array,
we need to check if the accessed port has been initialized.
Introduce and use helper function dsa_is_port_initialized
for that purpose and use it where needed.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support bridging offloads in DSA switch drivers, select
NET_SWITCHDEV to get access to the port_stp_update and parent_get_id
NDOs that we are required to implement.
To facilitate the integratation at the DSA driver level, we implement 3
types of operations:
- port_join_bridge
- port_leave_bridge
- port_stp_update
DSA will resolve which switch ports that are currently bridge port
members as some Switch hardware/drivers need to know about that to limit
the register programming to just the relevant registers (especially for
slow MDIO buses).
We also take care of setting the correct STP state when slave network
devices are brought up/down while being bridge members.
Finally, when a port is leaving the bridge, we make sure we set in
BR_STATE_FORWARDING state, otherwise the bridge layer would leave it
disabled as a result of having left the bridge.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when TCP/SCTP port reusing happens, IPVS will find the old
entry and use it for the new one, behaving like a forced persistence.
But if you consider a cluster with a heavy load of small connections,
such reuse will happen often and may lead to a not optimal load
balancing and might prevent a new node from getting a fair load.
This patch introduces a new sysctl, conn_reuse_mode, that allows
controlling how to proceed when port reuse is detected. The default
value will allow rescheduling of new connections only if the old entry
was in TIME_WAIT state for TCP or CLOSED for SCTP.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
The Churn Detection machines detect the situation where a port is operable,
but the Actor and Partner have not attached the link to an Aggregator and
brought the link into operation within a bound time period. Under normal
operation of the LACP, agreement between Actor and Partner should be reached
very rapidly. Continued failure to reach agreement can be symptomatic of
device failure.
Actor-churn-detection state-machine
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
===================================
BEGIN=True + PortEnable=False
|
v
+------------------------+ ActorPort.Sync=True +------------------+
| ACTOR_CHURN_MONITOR | ---------------------> | NO_ACTOR_CHURN |
|========================| |==================|
| ActorChurn=False | ActorPort.Sync=False | ActorChurn=False |
| ActorChurn.Timer=Start | <--------------------- | |
+------------------------+ +------------------+
| ^
| |
ActorChurn.Timer=Expired |
| ActorPort.Sync=True
| |
| +-----------------+ |
| | ACTOR_CHURN | |
| |=================| |
+--------------> | ActorChurn=True | ------------+
| |
+-----------------+
Similar for the Partner-churn-detection.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cfpkt_iterate() function can return -EPROTO on error, but the
function is a u16 so the negative value gets truncated to a positive
unsigned short. This causes a static checker warning.
The only caller which might care is cffrml_receive(), when it's checking
the frame checksum. I modified cffrml_receive() so that it never says
-EPROTO is a valid checksum.
Also this isn't ever going to be inlined so I removed the "inline".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hci_send_to_control() can be made more general purpose with a small
change of passing the desired HCI channel as a parameter to it. This
allows using it for the monitor channel as well as e.g. 6lowpan in the
future.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch moves all the disconn_cfm callbacks to be based on the hci_cb
list. This means making l2cap_disconn_cfm private to l2cap_core.c and
sco_conn_cb private to sco.c respectively. Since the hci_conn type
filtering isn't done any more on the wrapper level the callbacks
themselves need to check that they were passed a relevant type of
connection.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch moves all the connect_cfm callbacks to be based on the hci_cb
list. This means making l2cap_connect_cfm private to l2cap_core.c and
sco_connect_cb private to sco.c respectively. Since the hci_conn type
filtering isn't done any more on the wrapper level the callbacks
themselves need to check that they were passed a relevant type of
connection.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There's no reason to have the custom hci_proto_auth/encrypt_cfm helpers
when the hci_cb list works equally well. This patch adds L2CAP to the
hci_cb list and makes l2cap_security_cfm a private function of
l2cap_core.c.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We'll soon need to be able to sleep inside the loops that iterate the
hci_cb list, so neither a spinlock, rwlock or rcu are usable. This patch
changes the lock to a mutex which permits sleeping while holding the
lock.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pull networking updates from David Miller:
1) Missing netlink attribute validation in nft_lookup, from Patrick
McHardy.
2) Restrict ipv6 partial checksum handling to UDP, since that's the
only case it works for. From Vlad Yasevich.
3) Clear out silly device table sentinal macros used by SSB and BCMA
drivers. From Joe Perches.
4) Make sure the remote checksum code never creates a situation where
the remote checksum is applied yet the tunneling metadata describing
the remote checksum transformation is still present. Otherwise an
external entity might see this and apply the checksum again. From
Tom Herbert.
5) Use msecs_to_jiffies() where applicable, from Nicholas Mc Guire.
6) Don't explicitly initialize timer struct fields, use setup_timer()
and mod_timer() instead. From Vaishali Thakkar.
7) Don't invoke tg3_halt() without the tp->lock held, from Jun'ichi
Nomura.
8) Missing __percpu annotation in ipvlan driver, from Eric Dumazet.
9) Don't potentially perform skb_get() on shared skbs, also from Eric
Dumazet.
10) Fix COW'ing of metrics for non-DST_HOST routes in ipv6, from Martin
KaFai Lau.
11) Fix merge resolution error between the iov_iter changes in vhost and
some bug fixes that occurred at the same time. From Jason Wang.
12) If rtnl_configure_link() fails we have to perform a call to
->dellink() before unregistering the device. From WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (39 commits)
net: dsa: Set valid phy interface type
rtnetlink: call ->dellink on failure when ->newlink exists
com20020-pci: add support for eae single card
vhost_net: fix wrong iter offset when setting number of buffers
net: spelling fixes
net/core: Fix warning while make xmldocs caused by dev.c
net: phy: micrel: disable NAND-tree for KSZ8021, KSZ8031, KSZ8051, KSZ8081
ipv6: fix ipv6_cow_metrics for non DST_HOST case
openvswitch: Fix key serialization.
r8152: restore hw settings
hso: fix rx parsing logic when skb allocation fails
tcp: make sure skb is not shared before using skb_get()
bridge: netfilter: Move sysctl-specific error code inside #ifdef
ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc
ipvlan: add a missing __percpu pcpu_stats
tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()
bgmac: fix device initialization on Northstar SoCs (condition typo)
qlcnic: Delete existing multicast MAC list before adding new
net/mlx5_core: Fix configuration of log_uar_page_sz
sunvnet: don't change gso data on clones
...
This callback allows a vendor to send the vendor specific commands
before cloing the hci interface.
Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch cleanups the ieee802154_le64_to_be64 function. This patch
removes an unnecessary temporary variable.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch cleanups the ieee802154_be64_to_le64 function. This patch
removes an unnecessary temporary variable.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
wrapper functions.
Although this patch moves static keys decrement from __mem_cgroup_free to
mem_cgroup_css_free, it does not introduce any functional changes, because
the keys are incremented on setting the limit (tcp or kmem), which can
only happen after successful mem_cgroup_css_online.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dst_orig should be released on error. Function like __xfrm_route_forward()
expects that behavior.
Since a recent commit, xfrm_lookup() may also be called by xfrm_lookup_route(),
which expects the opposite.
Let's introduce a new flag (XFRM_LOOKUP_KEEP_DST_REF) to tell what should be
done in case of error.
Fixes: f92ee61982d("xfrm: Generate blackhole routes only from route lookup functions")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull security layer updates from James Morris:
"Highlights:
- Smack adds secmark support for Netfilter
- /proc/keys is now mandatory if CONFIG_KEYS=y
- TPM gets its own device class
- Added TPM 2.0 support
- Smack file hook rework (all Smack users should review this!)"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
cipso: don't use IPCB() to locate the CIPSO IP option
SELinux: fix error code in policydb_init()
selinux: add security in-core xattr support for pstore and debugfs
selinux: quiet the filesystem labeling behavior message
selinux: Remove unused function avc_sidcmp()
ima: /proc/keys is now mandatory
Smack: Repair netfilter dependency
X.509: silence asn1 compiler debug output
X.509: shut up about included cert for silent build
KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
MAINTAINERS: email update
tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
smack: fix possible use after frees in task_security() callers
smack: Add missing logging in bidirectional UDS connect check
Smack: secmark support for netfilter
Smack: Rework file hooks
tpm: fix format string error in tpm-chip.c
char/tpm/tpm_crb: fix build error
smack: Fix a bidirectional UDS connect check typo
smack: introduce a special case for tmpfs in smack_d_instantiate()
...
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.
To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.
In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack. While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit
971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB(). Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.
Cc: <stable@vger.kernel.org> # 3.18
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Packetization Layer Path MTU Discovery works separately beside
Path MTU Discovery at IP level, different net namespace has
various requirements on which one to chose, e.g., a virutalized
container instance would require TCP PMTU to probe an usable
effective mtu for underlying tunnel, while the host would
employ classical ICMP based PMTU to function.
Hence making TCP PMTU mechanism per net namespace to decouple
two functionality. Furthermore the probe base MSS should also
be configured separately for each namespace.
Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make __ipv6_select_ident() static as it isn't used outside
the file.
Fixes: 0508c07f5e (ipv6: Select fragment id during UFO segmentation if not set.)
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When queuing work to send the NETDEV_BONDING_INFO netdev event, it's
possible that when the work is executed, the pointer to the slave
becomes invalid. This can happen if between queuing the event and the
execution of the work, the net-device was un-ensvaled and re-enslaved.
Fix that by queuing a work with the data of the slave instead of the
slave structure.
Fixes: 69e6113343 ('net/bonding: Notify state change on slaves')
Reported-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPVS stats are limited to 2^(32-10) conns/s and packets/s,
2^(32-5) bytes/s. It is time to use 64 bits:
* Change all conn/packet kernel counters to 64-bit and update
them in u64_stats_update_{begin,end} section
* In kernel use struct ip_vs_kstats instead of the user-space
struct ip_vs_stats_user and use new func ip_vs_export_stats_user
to export it to sockopt users to preserve compatibility with
32-bit values
* Rename cpu counters "ustats" to "cnt"
* To netlink users provide additionally 64-bit stats:
IPVS_SVC_ATTR_STATS64 and IPVS_DEST_ATTR_STATS64. Old stats
remain for old binaries.
* We can use ip_vs_copy_stats in ip_vs_stats_percpu_show
Thanks to Chris Caputo for providing initial patch for ip_vs_est.c
Signed-off-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the SYN_RECV state, where the TCP connection is represented by
tcp_request_sock, we now rate-limit SYNACKs in response to a client's
retransmitted SYNs: we do not send a SYNACK in response to client SYN
if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
since we last sent a SYNACK in response to a client's retransmitted
SYN.
This allows the vast majority of legitimate client connections to
proceed unimpeded, even for the most aggressive platforms, iOS and
MacOS, which actually retransmit SYNs 1-second intervals for several
times in a row. They use SYN RTO timeouts following the progression:
1,1,1,1,1,2,4,8,16,32.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.
This patch includes:
- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending
The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.
We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.
We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.
The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.
The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.
Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU06ozAAoJEIqAPN1PVmxKdtkP/2CeQ0+xJnJWgyh4xjkVxfPb
wPYrz3cHeHdnVX36mnIdRoQZRjxh/Ef/vgK8RsXohcldzHA9D8GjxRpj0GrkVGQ9
xoDkiHsFDdsZHeezSrlnXNnObvZrkeMsmnUDJxfwLtcX1nzFKzBgW2GaDiNjUazC
r5Ip4YIfoCaVLq5MSqYRqJ6uhplXd/ePdtPoo54L/E81y4ptQi8pUsQBy4K3GrFn
FbXoemcymhi9AVBGl+OlppZ93t6bdOV1D5tcOwpytAbfa3jp2oUnJPZay7EoWruR
aj8l5v3P+SSphAJwaGSbny0L83tTS7sq+N4QG+VxxdMkw2YjpmxgN3AGguNBtPGG
SUThpZV6QhrkAOnwzP2BJnh68WSU1eJrGvk8zUWW0SanA8k2jaHO/VzXCA3/ZkC4
IFcXl4Jr6R3iUxDFgS/6MB5E9EGedzFTacsWoHVyZPtaDAQp4WVdAIRQ4QKgJCLw
1yRmCeVunTsrs6O0aenU1xHinPlnx+tkuiNh4H64APQYTz54WwXH1/S04OL1SkqI
mTzUvFgWqJvSIOuU19g0HBw+xZ/9vvX8LLkm9kSTbe7tcmFH5CI7ZCN/hNDHvMSd
sjNYOyag/SAHJ01tTKcSPAgWO49bHWPodI04zP0lK0gMLjKvRknVMdTgIDfu49HN
2da9E23iBmNNgG79iVHN
=8caD
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-3.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
NFC: 3.20 second pull request
This is the second NFC pull request for 3.20.
It brings:
- NCI NFCEE (NFC Execution Environment, typically an embedded or
external secure element) discovery and enabling/disabling support.
In order to communicate with an NFCEE, we also added NCI's logical
connections support to the NCI stack.
- HCI over NCI protocol support. Some secure elements only understand
HCI and thus we need to send them HCI frames when they're part of
an NCI chipset.
- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
running on a secure element needs to notify its host counterpart,
we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
NFC netlink socket.
- Secure element and HCI transaction event support for the st21nfcb
chipset.
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4429 ("Optimistic DAD") states that optimistic addresses
should be treated as deprecated addresses. From section 2.1:
Unless noted otherwise, components of the IPv6 protocol stack
should treat addresses in the Optimistic state equivalently to
those in the Deprecated state, indicating that the address is
available for use but should not be used if another suitable
address is available.
Optimistic addresses are indeed avoided when other addresses are
available (i.e. at source address selection time), but they have
not heretofore been available for things like explicit bind() and
sendmsg() with struct in6_pktinfo, etc.
This change makes optimistic addresses treated more like
deprecated addresses than tentative ones.
Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
include/net/ipv6.h:713:22: expected restricted __be32 [usertype] hash
include/net/ipv6.h:713:22: got unsigned int
include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
include/net/ipv6.h:719:22: warning: invalid assignment: ^=
include/net/ipv6.h:719:22: left side has type restricted __be32
include/net/ipv6.h:719:22: right side has type unsigned int
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(struct flow_keys)->n_proto is in network order, use
proper type for this.
Fixes following sparse errors :
net/core/flow_dissector.c:139:39: warning: incorrect type in assignment (different base types)
net/core/flow_dissector.c:139:39: expected unsigned short [unsigned] [usertype] n_proto
net/core/flow_dissector.c:139:39: got restricted __be16 [assigned] [usertype] proto
net/core/flow_dissector.c:237:23: warning: incorrect type in assignment (different base types)
net/core/flow_dissector.c:237:23: expected unsigned short [unsigned] [usertype] n_proto
net/core/flow_dissector.c:237:23: got restricted __be16 [assigned] [usertype] proto
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: e0f31d8498 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: David S. Miller <davem@davemloft.net>
When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.
All TCP had to do was to compute sk->pacing_rate using simple formula:
sk->pacing_rate = 2 * cwnd * mss / rtt
It works well for senders (bulk flows), but not very well for receivers
or even RPC :
cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.
Really, only the sender should pace, according to its own logic.
Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.
Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)
This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.
This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)
In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.
This patch is a combination of two patches we used for about one year at
Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use notifier chain to dispatch an event upon a change in slave state.
Event is dispatched with slave specific info.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move slave state changes to a helper function, this is a pre-step for adding
functionality of dispatching an event when this helper is called.
This commit doesn't add new functionality.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJU0NEaAAoJEDBSmw7B7bqr2DAP/3nbk6WB2Lz4Vwi8fh9C4y6X
4ZzN2v7NHaimC+Wpxg3wP2wCdX2VG3ZWwF3yLm6qgVHGnE35RFLxURen1eqNeW77
Yf5gvZ066nCGEQ8l8J6YK9vrLX4qp5c4lyE00bbxpZTA4Qq71SgTg+rmGC1be8uX
vacfaLwfDWffuOOnjBAPfanj7f4AQaUEY2uN1WkBFC7iEeOtPcWVkHAFVIeGjJfQ
vfgQJcwOjgWjYbwZdQQi7Aj+k1Yzda4pg1yEWn3CkZ8zyeCYGs1gk2ovoBgPgXBP
yT9ypIdFsN242VJvy7nkFnCKA8mhKyltMQ1Xjs0Q9lAxWdaq9U+iqt5cUn4jxVIG
T9Vi3PCbx/nVOqcfR81dBTQ3uDU5AyosPsmh2YTxi5lpRBrsjNY2FtaKE0sm2Om3
wRiSPOdPrXBeEnU0KssI9e6euXgS4JQV78Naq85OWZDd2yZ1fT5U2fi8y4drRGlz
rucbbobdVhQch5L4FStPz1uW5pNuJrhekXeZIE8MruTNg2A2oBAK3ApO7hxn68sE
RnbAnkxVLwgedC9042JF5eiS1PDIU46w4e782j/+XskVKMEakqd23iJycx3tmgHZ
cxDi/qKZ2RCE74YsT61o/9ErSVPvCfNPL3+CVov918jidQQg2WHfKm/jFlIxHIxk
4wBP7p2VGlENMgw/R8GJ
=wOaR
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Last round of updates for net-next:
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-02-03
Here's what's likely the last bluetooth-next pull request for 3.20.
Notable changes include:
- xHCI workaround + a new id for the ath3k driver
- Several new ids for the btusb driver
- Support for new Intel Bluetooth controllers
- Minor cleanups to ieee802154 code
- Nested sleep warning fix in socket accept() code path
- Fixes for Out of Band pairing handling
- Support for LE scan restarting for HCI_QUIRK_STRICT_DUPLICATE_FILTER
- Improvements to data we expose through debugfs
- Proper handling of Hardware Error HCI events
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
conn_info is currently allocated only after nfcee_discovery_ntf
which is not generic enough for logical connection other than
NFCEE. The corresponding conn_info is now created in
nci_core_conn_create_rsp().
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
For consistency sake change nci_core_conn_create_rsp structure
credits field to credits_cnt.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The current implementation limits nci_core_conn_create_req()
to only manage NCI_DESTINATION_NFCEE.
Add new parameters to nci_core_conn_create() to support all
destination types described in the NCI specification.
Because there are some parameters with variable size dynamic
buffer allocation is needed.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The NCI_STATIC_RF_CONN_ID logical connection is the most used
connection. Keeping it directly accessible in the nci_dev
structure will simplify and optimize the access.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
If the IPv6 fragment id has not been set and we perform
fragmentation due to UFO, select a new fragment id.
We now consider a fragment id of 0 as unset and if id selection
process returns 0 (after all the pertrubations), we set it to
0x80000000, thus giving us ample space not to create collisions
with the next packet we may have to fragment.
When doing UFO integrity checking, we also select the
fragment id if it has not be set yet. This is stored into
the skb_shinfo() thus allowing UFO to function correclty.
This patch also removes duplicate fragment id generation code
and moves ipv6_select_ident() into the header as it may be
used during GSO.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
That takes care of the majority of ->sendmsg() instances - most of them
via memcpy_to_msg() or assorted getfrag() callbacks. One place where we
still keep memcpy_fromiovecend() is tipc - there we potentially read the
same data over and over; separate patch, that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
patch is actually smaller than it seems to be - most of it is unindenting
the inner loop body in tcp_sendmsg() itself...
the bit in tcp_input.c is going to get reverted very soon - that's what
memcpy_from_msg() will become, but not in this commit; let's keep it
reasonably contained...
There's one potentially subtle change here: in case of short copy from
userland, mainline tcp_send_syn_data() discards the skb it has allocated
and falls back to normal path, where we'll send as much as possible after
rereading the same data again. This patch trims SYN+data skb instead -
that way we don't need to copy from the same place twice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Validate hooks for nf_tables NAT expressions, otherwise users can
crash the kernel when using them from the wrong hook. We already
got one user trapped on this when configuring masquerading.
2) Fix a BUG splat in nf_tables with CONFIG_DEBUG_PREEMPT=y. Reported
by Andreas Schultz.
3) Avoid unnecessary reroute of traffic in the local input path
in IPVS that triggers a crash in in xfrm. Reported by Florian
Wiessner and fixes by Julian Anastasov.
4) Fix memory and module refcount leak from the error path of
nf_tables_newchain().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is very similar to
commit 1c32c5ad6f
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue Mar 1 02:36:47 2011 +0000
inet: Add ip_make_skb and ip_finish_skb
It adds IPv6 version of the helpers ip6_make_skb and ip6_finish_skb.
The job of ip6_make_skb is to collect messages into an ipv6 packet
and poplulate ipv6 eader. The job of ip6_finish_skb is to transmit
the generated skb. Together they replicated the job of
ip6_push_pending_frames() while also provide the capability to be
called independently. This will be needed to add lockless UDP sendmsg
support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.
Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.
The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.
No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
Changes
(v1 -> v2)
- test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
(rfc -> v1)
- document the sysctl in Documentation/sysctl/net.txt
- fix access control race: read .._OPT_TSONLY only once,
use same value for permission check and skb generation.
Signed-off-by: David S. Miller <davem@davemloft.net>
The NFCC sends an NCI_OP_RF_NFCEE_ACTION_NTF notification
to the host (DH) to let it know that for example an RF
transaction with a payment reader is done.
For now the notification handler is empty.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NFC_EVT_TRANSACTION is sent through netlink in order for a
specific application running on a secure element to notify
userspace of an event. Typically the secure element application
counterpart on the host could interpret that event and act
upon it.
Forwarded information contains:
- SE host generating the event
- Application IDentifier doing the operation
- Applications parameters
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
According to the NCI specification, one can use HCI over NCI
to talk with specific NFCEE. The HCI network is viewed as one
logical NFCEE.
This is needed to support secure element running HCI only
firmwares embedded on an NCI capable chipset, like e.g. the
st21nfcb.
There is some duplication between this piece of code and the
HCI core code, but the latter would need to be abstracted even
more to be able to use NCI as a logical transport for HCP packets.
Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>