This bug was detected by kmemleak:
unreferenced object 0xffff8804269cc3c0 (size 64):
comm "criu", pid 1042, jiffies 4294907360 (age 13.713s)
hex dump (first 32 bytes):
a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00 .2.,............
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................
backtrace:
[<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0
[<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280
[<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag]
v2: don't remove a reference on a rhashtable_iter structure to
release it from netlink_diag_dump_done
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint for working out where local aborts happen. Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.
rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc_set_call_completion() returns bool, not int, so the ret variable
should match this.
rxrpc_call_completed() and __rxrpc_call_completed() should return the value
of rxrpc_set_call_completion().
Signed-off-by: David Howells <dhowells@redhat.com>
rxrpc calls shouldn't hold refs on the sock struct. This was done so that
the socket wouldn't go away whilst the call was in progress, such that the
call could reach the socket's queues.
However, we can mark the socket as requiring an RCU release and rely on the
RCU read lock.
To make this work, we do:
(1) rxrpc_release_call() removes the call's call user ID. This is now
only called from socket operations and not from the call processor:
rxrpc_accept_call() / rxrpc_kernel_accept_call()
rxrpc_reject_call() / rxrpc_kernel_reject_call()
rxrpc_kernel_end_call()
rxrpc_release_calls_on_socket()
rxrpc_recvmsg()
Though it is also called in the cleanup path of
rxrpc_accept_incoming_call() before we assign a user ID.
(2) Pass the socket pointer into rxrpc_release_call() rather than getting
it from the call so that we can get rid of uninitialised calls.
(3) Fix call processor queueing to pass a ref to the work queue and to
release that ref at the end of the processor function (or to pass it
back to the work queue if we have to requeue).
(4) Skip out of the call processor function asap if the call is complete
and don't requeue it if the call is complete.
(5) Clean up the call immediately that the refcount reaches 0 rather than
trying to defer it. Actual deallocation is deferred to RCU, however.
(6) Don't hold socket refs for allocated calls.
(7) Use the RCU read lock when queueing a message on a socket and treat
the call's socket pointer according to RCU rules and check it for
NULL.
We also need to use the RCU read lock when viewing a call through
procfs.
(8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call()
if this hasn't been done yet so that we can then disconnect the call.
Once the call is disconnected, it won't have any access to the
connection struct and the UDP socket for the call work processor to be
able to send the ACK. Terminal retransmission will be handled by the
connection processor.
(9) Release all calls immediately on the closing of a socket rather than
trying to defer this. Incomplete calls will be aborted.
The call refcount model is much simplified. Refs are held on the call by:
(1) A socket's user ID tree.
(2) A socket's incoming call secureq and acceptq.
(3) A kernel service that has a call in progress.
(4) A queued call work processor. We have to take care to put any call
that we failed to queue.
(5) sk_buffs on a socket's receive queue. A future patch will get rid of
this.
Whilst we're at it, we can do:
(1) Get rid of the RXRPC_CALL_EV_RELEASE event. Release is now done
entirely from the socket routines and never from the call's processor.
(2) Get rid of the RXRPC_CALL_DEAD state. Calls now end in the
RXRPC_CALL_COMPLETE state.
(3) Get rid of the rxrpc_call::destroyer work item. Calls are now torn
down when their refcount reaches 0 and then handed over to RCU for
final cleanup.
(4) Get rid of the rxrpc_call::deadspan timer. Calls are cleaned up
immediately they're finished with and don't hang around.
Post-completion retransmission is handled by the connection processor
once the call is disconnected.
(5) Get rid of the dead call expiry setting as there's no longer a timer
to set.
(6) rxrpc_destroy_all_calls() can just check that the call list is empty.
Signed-off-by: David Howells <dhowells@redhat.com>
Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() if the call
is available just in case call->conn is NULL.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass the connection pointer to rxrpc_post_packet_to_call() as the call
might get disconnected whilst we're looking at it, but the connection
pointer determined by rxrpc_data_read() is guaranteed by RCU for the
duration of the call.
Signed-off-by: David Howells <dhowells@redhat.com>
Cache the security index in the rxrpc_call struct so that we can get at it
even when the call has been disconnected and the connection pointer
cleared.
Signed-off-by: David Howells <dhowells@redhat.com>
Use call->peer rather than call->conn->params.peer to avoid the possibility
of call->conn being NULL and, whilst we're at it, check it for NULL before we
access it.
Signed-off-by: David Howells <dhowells@redhat.com>
Improve the call tracking tracepoint by showing more differentiation
between some of the put and get events, including:
(1) Getting and putting refs for the socket call user ID tree.
(2) Getting and putting refs for queueing and failing to queue the call
processor work item.
Note that these aren't necessarily used in this patch, but will be taken
advantage of in future patches.
An enum is added for the event subtype numbers rather than coding them
directly as decimal numbers and a table of 3-letter strings is provided
rather than a sequence of ?: operators.
Signed-off-by: David Howells <dhowells@redhat.com>
In general, when DAD detected IPv6 duplicate address, ifp->state
will be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a
delayed work, the call tree should be like this:
ndisc_recv_ns
-> addrconf_dad_failure <- missing ifp put
-> addrconf_mod_dad_work
-> schedule addrconf_dad_work()
-> addrconf_dad_stop() <- missing ifp hold before call it
addrconf_dad_failure() called with ifp refcont holding but not put.
addrconf_dad_work() call addrconf_dad_stop() without extra holding
refcount. This will not cause any issue normally.
But the race between addrconf_dad_failure() and addrconf_dad_work()
may cause ifp refcount leak and netdevice can not be unregister,
dmesg show the following messages:
IPv6: eth0: IPv6 duplicate address fe80::XX:XXXX:XXXX:XX detected!
...
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Cc: stable@vger.kernel.org
Fixes: c15b1ccadb ("ipv6: move DAD and addrconf_verify processing
to workqueue")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deleting an IP address from an interface, there is a clean-up of
routes which refer to this local address. However, there was no check to
see that the VRF matched. This meant that deletion wasn't confined to
the VRF it should have been.
To solve this, a new field has been added to fib_info to hold a table
id. When removing fib entries corresponding to a local ip address, this
table id is also used in the comparison.
The table id is populated when the fib_info is created. This was already
done in some places, but not in ip_rt_ioctl(). This has now been fixed.
Fixes: 021dd3b8a1 ("net: Add routes to the table associated with the device")
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV8yHP/Sw1s6N8H32AQL7cQ//S3MP4x690tV49UocezdAV8PuV3HpM5kt
a+43v8RqjTHaZQxOwQaG71OpAM2gH1z7QPSq6jsLy0PEmaPiP/MZOAWqUudBAyqo
eVsEMEVlw0f7PogyKqr06uPoVo21IdLNX9E89CqaGqFuDfsKlj1bKQHH4/248Arr
i4ztzid/Mj98fHvzqMdr631c06GvLozU/5X6xE7hzkGkqVmRtjIB6qETGqerwx/p
GJlcTZVFw2EviS6/Ft/t26xgVsOg1ogzXjWLUufnnJ1GpRaucqMfHwYD0WqQV3sB
Bu6WRx2JgXRPBi5m7gWymkgT0pUNRiDFuWN6qdlJbHJgKGuVojF6tnh2go2bj9Cq
q/GLbi8Y810v64293i1vdz6yyM1PzDG648+6z8vbTpsLI7cHDq5csPYMHRIh34IM
FQmSZblKIuALD8BXqW1lXrqVKU0WEFVI9WjcRk9OSIanqPrQyP8xOAJVp03uIGe/
uxkheJBy7XvghwKWZNZ1y0A0g6NxY0gNMhVsmW77VbNfOsKOpEcpSCvMwjgMOvDi
npynisM8tf2cx973SZYXjvvl2LSMHjiVQ/tf0iTMYmBjtU7ft5i8p3w290yAD+JQ
JKqpKq3TXVty5MMEXl5bStIr/HkgPk7e6v1sH9WIREu9m9gROANGe/9Mr4VvweHk
jEuFHy2EZNE=
=3Don
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160904-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Split output code from sendmsg code
Here's a set of small patches that split the packet transmission code from
the sendmsg code and simply rearrange the new file to make it more
logically laid out ready for being rewritten. An enum is also moved out of
the header file to there as it's only used there. This needs to be applied
on top of the just-posted fixes patch set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV8yHKPSw1s6N8H32AQKbWA/+MEjzVzIOpp4edvAj5DzdO5pls3NvhUAg
2sP15E9nOL9JHrOD5snCVSRd9ROEGRE0/S9WXUjeb2VKz7C2pmTixo+MjiJRMzAR
TZZlE2Ydx0h+A7WiywKoTY4g7VICL+UC+4XcheMzTLNS2mzqKb2GGOm3BwvGaFeT
RKRVIlwIHziShaIs7K7ZkfKxGaDIL/9x344uPfFHaDKb33aOBnTuY6HlFf5Yu2qm
pzh+R7cBYJMvhEqd71ESYPSbSGnjBN5zRzjZDSvXeI/30k9Ee6mFastv4fdG3Mrk
0WOLxml9yLTlJnfXeuN0T9B0C/ur4oD4hKEDREXVTxcTEXRq/VxSJ3cYxeM3DlAz
U795lcveiYajRv7F73jcfNuaEENQg5HsZuaFs+CJgVxQpsqs9IOpEl9CNrVsvjmf
9crgamUj34ehZ5lgsV/Qbm7OFk16dmQ59ImGClQFgsIU6hEWCP0VgwKXWTwP7YZD
Ucp1zWp/XTAtbrRNdkye/Z0WE5QOtoUWzftPrf95TP7LFewp0DGZ+miBV/atSHbZ
bADXR0SOtax8u9bfs1HYTadwHk1LJYuVXNYn/KN06c9q1WKuKH/qg0JAjv/KUnJm
Nnx0TFh4kUg999+3CtxqmupiqkD5SCckCN2ggifgDCmQhEwr1CYrGKWpZvNMq+S6
nVNTUr7PYSI=
=RRYk
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160904-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Small fixes
Here's a set of small fix patches:
(1) Fix some uninitialised variables.
(2) Set the client call state before making it live by attaching it to the
conn struct.
(3) Randomise the epoch and starting client conn ID values, and don't
change the epoch when the client conn ID rolls round.
(4) Replace deprecated create_singlethread_workqueue() calls.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Neither the failure or success paths of ping_v6_sendmsg release
the dst it acquires. This leads to a flood of warnings from
"net/core/dst.c:288 dst_release" on older kernels that
don't have 8bf4ada2e2 backported.
That patch optimistically hoped this had been fixed post 3.10, but
it seems at least one case wasn't, where I've seen this triggered
a lot from machines doing unprivileged icmp sockets.
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.
More specifically, they are:
1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
transport area in dccp, sctp, tcp, udp and udplite protocol
conntrackers, from Gao Feng.
2) Missing whitespace on error message in physdev match, from Hangbin Liu.
3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.
4) Add nf_ct_expires() helper function and use it, from Florian Westphal.
5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
from Florian.
6) Rename nf_tables set implementation to nft_set_{name}.c
7) Introduce the hash expression to allow arbitrary hashing of selector
concatenations, from Laura Garcia Liebana.
8) Remove ip_conntrack sysctl backward compatibility code, this code has
been around for long time already, and we have two interfaces to do
this already: nf_conntrack sysctl and ctnetlink.
9) Use nf_conntrack_get_ht() helper function whenever possible, instead
of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.
10) Add quota expression for nf_tables.
11) Add number generator expression for nf_tables, this supports
incremental and random generators that can be combined with maps,
very useful for load balancing purpose, again from Laura Garcia Liebana.
12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.
13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
configuration, this is used by a follow up patch to perform better chain
update validation.
14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
nft_set_hash implementation to honor the NLM_F_EXCL flag.
15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
patch from Florian Westphal.
16) Don't use the DYING bit to know if the conntrack event has been already
delivered, instead a state variable to track event re-delivery
states, also from Florian.
17) Remove the per-conntrack timer, use the workqueue approach that was
discussed during the NFWS, from Florian Westphal.
18) Use the netlink conntrack table dump path to kill stale entries,
again from Florian.
19) Add a garbage collector to get rid of stale conntracks, from
Florian.
20) Reschedule garbage collector if eviction rate is high.
21) Get rid of the __nf_ct_kill_acct() helper.
22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.
23) Make nf_log_set() interface assertive on unsupported families.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rearrange net/rxrpc/sendmsg.c to be in a more logical order. This makes it
easier to follow and eliminates forward declarations.
Signed-off-by: David Howells <dhowells@redhat.com>
It seems the local epoch should only be changed on boot, so remove the code
that changes it for client connections.
Signed-off-by: David Howells <dhowells@redhat.com>
Create a random epoch value rather than a time-based one on startup and set
the top bit to indicate that this is the case.
Also create a random starting client connection ID value. This will be
incremented from here as new client connections are created.
Signed-off-by: David Howells <dhowells@redhat.com>
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.
The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking. The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.
We tried to fix this earlier with commit c845acb324 ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.
Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.
Acked-by: Rainer Weikusat <rweikusat@cyberadapt.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c845acb324.
It turns out that it just replaces one deadlock with another one: we can
still get the wrong lock ordering with the readlock due to overlayfs
calling back into the filesystem layer and still taking the vfs locks
after the readlock.
The proper solution ends up being to just split the readlock into two
pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
(taken *inside* the filesystem locks). The two locks are independent
anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following few steps will crash kernel -
(a) Create bonding master
> modprobe bonding miimon=50
(b) Create macvlan bridge on eth2
> ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
type macvlan
(c) Now try adding eth2 into the bond
> echo +eth2 > /sys/class/net/bond0/bonding/slaves
<crash>
Bonding does lots of things before checking if the device enslaved is
busy or not.
In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.
This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We never read or change netns id in hardirq context,
the only place we read netns id in softirq context
is in vxlan_xmit(). So, it should be enough to just
disable BH.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netns id should be already allocated each time we change
netns, that is, in dev_change_net_namespace() (more precisely
in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an error occurs during conntrack template creation as part of
actions validation, we need to free the template. Previously we've been
using nf_ct_put() to do this, but nf_ct_tmpl_free() is more appropriate.
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must set the client call state to RXRPC_CALL_CLIENT_SEND_REQUEST before
attaching the call to the connection struct, not after, as it's liable to
receive errors and conn aborts as soon as the assignment is made - and
these will cause its state to be changed outside of the initiating thread's
control.
Signed-off-by: David Howells <dhowells@redhat.com>
Because of the risk of an excessive number of NACK messages and
retransissions, receivers have until now abstained from sending
broadcast NACKS directly upon detection of a packet sequence number
gap. We have instead relied on such gaps being detected by link
protocol STATE message exchange, something that by necessity delays
such detection and subsequent retransmissions.
With the introduction of unicast NACK transmission and rate control
of retransmissions we can now remove this limitation. We now allow
receiving nodes to send NACKS immediately, while coordinating the
permission to do so among the nodes in order to avoid NACK storms.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As cluster sizes grow, so does the amount of identical or overlapping
broadcast NACKs generated by the packet receivers. This often leads to
'NACK crunches' resulting in huge numbers of redundant retransmissions
of the same packet ranges.
In this commit, we introduce rate control of broadcast retransmissions,
so that a retransmitted range cannot be retransmitted again until after
at least 10 ms. This reduces the frequency of duplicate, redundant
retransmissions by an order of magnitude, while having a significant
positive impact on overall throughput and scalability.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we send broadcasts in clusters of more 70-80 nodes, we sometimes
see the broadcast link resetting because of an excessive number of
retransmissions. This is caused by a combination of two factors:
1) A 'NACK crunch", where loss of broadcast packets is discovered
and NACK'ed by several nodes simultaneously, leading to multiple
redundant broadcast retransmissions.
2) The fact that the NACKS as such also are sent as broadcast, leading
to excessive load and packet loss on the transmitting switch/bridge.
This commit deals with the latter problem, by moving sending of
broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
bits in word 8 of the said message for this purpose, and introduce a
new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
backwards compatible.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following uninitialised variable warning:
../net/rxrpc/call_event.c: In function 'rxrpc_process_call':
../net/rxrpc/call_event.c:879:58: warning: 'error' may be used uninitialized in this function [-Wmaybe-uninitialized]
_debug("post net error %d", error);
^
Signed-off-by: David Howells <dhowells@redhat.com>
gcc -Wmaybe-initialized correctly points out a newly introduced bug
through which we can end up calling rxrpc_queue_call() for a dead
connection:
net/rxrpc/call_object.c: In function 'rxrpc_mark_call_released':
net/rxrpc/call_object.c:600:5: error: 'sched' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This sets the 'sched' variable to zero to restore the previous
behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: f5c17aaeb2 ("rxrpc: Calls should only have one terminal state")
Signed-off-by: David Howells <dhowells@redhat.com>
Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq
-> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU ->
l2tp_tunnel_destruct).
By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish
destroying the socket, the private data reserved via the net_generic
mechanism has already been freed, but l2tp_tunnel_destruct() actually
uses this data.
Make sure tunnel deletion for the netns has completed before returning
from l2tp_exit_net() by first flushing the tunnel removal workqueue, and
then waiting for RCU callbacks to complete.
Fixes: 167eb17e0b ("l2tp: create tunnel sockets in the right namespace")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8eb30be035 ("ipv6: Create ip6_tnl_xmit") unsets
flowi6_proto in ip4ip6_tnl_xmit() and ip6ip6_tnl_xmit().
Since xfrm_selector_match() relies on this info, IPv6 packets
sent by an ip6tunnel cannot be properly selected by their
protocols after removing it. This patch puts flowi6_proto back.
Cc: stable@vger.kernel.org
Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a per-port flag to control the unknown multicast flood, similar to the
unknown unicast flood flag and break a few long lines in the netlink flag
exports.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the unicast flag and introduce an exact pkt_type. That would help us
for the upcoming per-port multicast flood flag and also slightly reduce the
tests in the input fast path.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original codes depend on that the function parameters are evaluated from
left to right. But the parameter's evaluation order is not defined in C
standard actually.
When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys,
hashrnd) with some compilers or environment, the keys passed to
flow_keys_have_l4 is not initialized.
Fixes: 6db61d79c1 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash")
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.
To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.
In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.
Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065
real 1m11.791s
user 0m0.070s
sys 1m8.395s
(existing code does not return all macs)
after patch:
$time bridge fdb show | wc -l
35085
real 0m2.017s
user 0m0.113s
sys 0m1.942s
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.
This makes the following possibilities more achievable:
(1) Call refcounting can be made simpler if skbs don't hold refs to calls.
(2) skbs referring to non-data events will be able to be freed much sooner
rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
will be able to consult the call state.
(3) We can shortcut the receive phase when a call is remotely aborted
because we don't have to go through all the packets to get to the one
cancelling the operation.
(4) It makes it easier to do encryption/decryption directly between AFS's
buffers and sk_buffs.
(5) Encryption/decryption can more easily be done in the AFS's thread
contexts - usually that of the userspace process that issued a syscall
- rather than in one of rxrpc's background threads on a workqueue.
(6) AFS will be able to wait synchronously on a call inside AF_RXRPC.
To make this work, the following interface function has been added:
int rxrpc_kernel_recv_data(
struct socket *sock, struct rxrpc_call *call,
void *buffer, size_t bufsize, size_t *_offset,
bool want_more, u32 *_abort_code);
This is the recvmsg equivalent. It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.
afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them. They don't wait synchronously yet because the socket
lock needs to be dealt with.
Five interface functions have been removed:
rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()
As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user. To process the queue internally, a temporary function,
temp_deliver_data() has been added. This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung noticed that on the first TFO server data packet sent after
the (TFO) handshake, the server echoed the TCP timestamp value in the
SYN/data instead of the timestamp value in the final ACK of the
handshake. This problem did not happen on regular opens.
The tcp_replace_ts_recent() logic that decides whether to remember an
incoming TS value needs tp->rcv_wup to hold the latest receive
sequence number that we have ACKed (latest tp->rcv_nxt we have
ACKed). This commit fixes this issue by ensuring that a TFO server
properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
a SYN/ACK for the SYN/data.
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the
skb isn't changed/dropped and processing continues so we shouldn't
increment tx_dropped.
CC: Kyeyoon Park <kyeyoonp@codeaurora.org>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All changes are notified, but the initial state was missing.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'default' value was not advertised.
Fixes: f3a1bfb11c ("rtnl/ipv6: use netconf msg to advertise forwarding status")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
return at end of function is useless.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.
When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.
When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).
Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.
In this commit, we adjust the size of name distributor message so that
they can be tunnelled.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry reported a double free on kcm socket, which could
be easily reproduced by:
#include <unistd.h>
#include <sys/syscall.h>
int main()
{
int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0);
syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0);
return 0;
}
This is because on the error path, after we install
the new socket file, we call sock_release() to clean
up the socket, which leaves the fd pointing to a freed
socket. Fix this by calling sys_close() on that fd
directly.
Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bc8c20acae ("bridge: multicast: treat igmpv3 report with
INCLUDE and no sources as a leave") seems to have accidentally reverted
commit 47cc84ce0c ("bridge: fix parsing of MLDv2 reports"). This
commit brings back a change to br_ip6_multicast_mld2_report() where
parsing of MLDv2 reports stops when the first group is successfully
added to the MDB cache.
Fixes: bc8c20acae ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Lennert the MPLS GSO code is failing to properly segment
large packets. There are a couple of problems:
1. the inner protocol is not set so the gso segment functions for inner
protocol layers are not getting run, and
2 MPLS labels for packets that use the "native" (non-OVS) MPLS code
are not properly accounted for in mpls_gso_segment.
The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment
to call the gso segment functions for the higher layer protocols. That
means skb_mac_gso_segment is called twice -- once with the network
protocol set to MPLS and again with the network protocol set to the
inner protocol.
This patch sets the inner skb protocol addressing item 1 above and sets
the network_header and inner_network_header to mark where the MPLS labels
start and end. The MPLS code in OVS is also updated to set the two
network markers.
>From there the MPLS GSO code uses the difference between the network
header and the inner network header to know the size of the MPLS header
that was pushed. It then pulls the MPLS header, resets the mac_len and
protocol for the inner protocol and then calls skb_mac_gso_segment
to segment the skb.
Afterward the inner protocol segmentation is done the skb protocol
is set to mpls for each segment and the network and mac headers
restored.
Reported-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.
To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)
This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.
This includes IPV6 and some mtu fixes and testing from David Ahern.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 145dd5f9c8 ("net: flush the softnet backlog in process
context"), we can easily batch calls to flush_all_backlogs() for all
devices processed in rollback_registered_many()
Tested:
Before patch, on an idle host.
modprobe dummy numdummies=10000
perf stat -e context-switches -a rmmod dummy
Performance counter stats for 'system wide':
1,211,798 context-switches
1.302137465 seconds time elapsed
After patch:
perf stat -e context-switches -a rmmod dummy
Performance counter stats for 'system wide':
225,523 context-switches
0.721623566 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
1) Allow nf_tables reject expression from input, forward and output hooks,
since only there the routing information is available, otherwise we crash.
2) Fix unsafe list iteration when flushing timeout and accouting objects.
3) Fix refcount leak on timeout policy parsing failure.
4) Unlink timeout object for unconfirmed conntracks too
5) Missing validation of pkttype mangling from bridge family.
6) Fix refcount leak on ebtables on second lookup for the specific
bridge match extension, this patch from Sabrina Dubroca.
7) Remove unnecessary ip_hdr() in nf_tables_netdev family.
Patches from 1-5 and 7 from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a recent wext patch, which Ben Hutchings noticed was
wrong, and it turns out not to be necessary for any driver
* fix an infinite loop that can occur under certain conditions
in mac80211's TDLS code (depending on regulatory information)
* add a cfg80211_get_station() static inline when cfg80211 isn't
built, to allow other modules to not have to depend on it for it
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXxSR4AAoJEGt7eEactAAdl3EP/AiUwqrYqbLnnFy6C7obFS3p
eBBMxQAZbT+q+fFlZvqRrt5tPdkYriPLhm/0sAzuapnyS+Q6seNJ/vPoo91uC1jU
ZI/j97v9NwUtRLfNCq+0Jwvs7ma0U1VEcPV9wDdV5JgnKk0Z1CUIcsErYr1+v0YQ
EpRwxczhzJNTULW36UP7RvVQpxwIGldPhxSZ0t1uHWaYTFliaTlnJUAk0ql44Lmm
WLvoMSjFgX99P11ToCe81MPEzF2IXILvxPwtNZmn5tldEN2xknKEoEmmbN65fYDf
OIJIJ3s1CijQvnkgXtU0RWWCMnyOoJjsLckgSDdy0euhbS5xRIfxBN2n+kqaI9WV
a/aIvWNNhvAy2vNdWUJk0FrVBnDjlTtG1afIEAgJyP7uxTQqepQfyaRENLtH+kKe
lWbOITUZztyagGIn8Bv1pDrrqwO+fSjiEsVEVAQMMmNKBpUWf8urhDQmabCLYGDB
Nxh2e3wjv5ZQ+55uJIGRDCcPIrddh86FVtQBqTID+86r4a1RwPaWfhzFZYVj84fg
504UzwtYlw1ITUhGbdMwribLVkwtBMVuEvpPrh6avwzS8wAH4upkhp4GGl/tfd/Z
De0LqpxCKbDiI+VmmDo8FD4nx4wu4nTYIaecLjNoUSXhbjbPyI6V8/hIXqKiQUA3
ObkKlGicZJmhMa4zna0q
=dFzv
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Three little fixes:
* revert a recent wext patch, which Ben Hutchings noticed was
wrong, and it turns out not to be necessary for any driver
* fix an infinite loop that can occur under certain conditions
in mac80211's TDLS code (depending on regulatory information)
* add a cfg80211_get_station() static inline when cfg80211 isn't
built, to allow other modules to not have to depend on it for it
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable patches:
- Fix a refcount leak in nfs_callback_up_net
- Fix an Oopsable condition when the flexfile pNFS driver connection to
the DS fails
- Fix an Oopsable condition in NFSv4.1 server callback races
- Ensure pNFS clients stop doing I/O to the DS if their lease has expired,
as required by the NFSv4.1 protocol
Bugfixes:
- Fix potential looping in the NFSv4.x migration code
- Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
- Silence WARN_ON when NFSv4.1 over RDMA is in use
- Fix a LAYOUTCOMMIT race in the pNFS/blocks client
- Fix pNFS timeout issues when the DS fails
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJXxbnyAAoJEGcL54qWCgDykWoP/jqgBBR/cSaOtx+5m39wlf0P
pTdQkgcpWnhBS90tKZtC6zfJ2DFVt8sUNVn9+mVzT4Q7TgEcAmENQ//s0igxHLbl
bkXPvULydvD05Db8m1xmq2snj72tWbpg3CaA7nfx6yiP63k237QxhyNZVkmEQDur
ynU8dPzmxRaSTQdVgatdS0zqx8sF47OFnXVxkV0ssBKORGsWj3yKDcs293NZNFAM
Ztkih5oW1mm+BtWUQVNrjRnfZFG+PxAxWv090JM6wABDRbDHwSaKmwmI0kWRKXoH
DHrj4i/Wzws65Fg5AyVPSRkF8YvHSVsLnw/FlwKKZFsrWjU6WtLdLSzgzwQ47x98
tQk/YGgNyiiD1cAcw+l0d3Ct1SO4AptNuisdJK0cn3iCdsbh6Y0eW6yRRtQY6jQI
8qOyMTT8fp9ooEQK+nMNOhJVVlsG0hbvWAt/uiiBdPhjAfVB0UFRuua/vNKUO7yv
hJkDY9i7EkMXKACf5BCpBuvYdU7rwqp43K9x34029A5vFTKOhJZS4hnAIocDd/WF
Hw7yqHdpkvI5RgFbBV5tmfZPyS65k8AzzTtT1QHKlH0qEtN2iMaXsXM9EzK5bKfW
85Cc6yzRk7NzDZKmZFs/T8zCYdzet48sCY7wVyOQjL0aIkIDNNcZhex+C1GuD1dp
Ld0H5f9eZdwv/OAqJ8tm
=U+XK
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix a refcount leak in nfs_callback_up_net
- Fix an Oopsable condition when the flexfile pNFS driver connection
to the DS fails
- Fix an Oopsable condition in NFSv4.1 server callback races
- Ensure pNFS clients stop doing I/O to the DS if their lease has
expired, as required by the NFSv4.1 protocol
Bugfixes:
- Fix potential looping in the NFSv4.x migration code
- Patch series to close callback races for OPEN, LAYOUTGET and
LAYOUTRETURN
- Silence WARN_ON when NFSv4.1 over RDMA is in use
- Fix a LAYOUTCOMMIT race in the pNFS/blocks client
- Fix pNFS timeout issues when the DS fails"
* tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.x: Fix a refcount leak in nfs_callback_up_net
NFS4: Avoid migration loops
pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails
NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence
NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
NFSv4.1: Defer bumping the slot sequence number until we free the slot
NFSv4.1: Delay callback processing when there are referring triples
NFSv4.1: Fix Oopsable condition in server callback races
SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use
pnfs/blocklayout: update last_write_offset atomically with extents
pNFS: The client must not do I/O to the DS if it's lease has expired
pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls
pNFS/flexfiles: Set reasonable default retrans values for the data channel
NFS: Allow the mount option retrans=0
pNFS/flexfiles: Fix layoutstat periodic reporting
Pass struct socket * to more rxrpc kernel interface functions. They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.
I have left:
rxrpc_kernel_is_data_last()
rxrpc_kernel_get_abort_code()
rxrpc_kernel_get_error_number()
rxrpc_kernel_free_skb()
rxrpc_kernel_data_consumed()
unmodified as they're all about to be removed (and, in any case, don't
touch the socket).
Signed-off-by: David Howells <dhowells@redhat.com>
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:
void rxrpc_kernel_get_peer(struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.
Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.
Signed-off-by: David Howells <dhowells@redhat.com>
Condense the terminal states of a call state machine to a single state,
plus a separate completion type value. The value is then set, along with
error and abort code values, only when the call is transitioned to the
completion state.
Helpers are provided to simplify this.
Signed-off-by: David Howells <dhowells@redhat.com>
The call pointer in a channel on a connection will be NULL if there's no
active call on that channel. rxrpc_abort_calls() needs to check for this
before trying to take the call's state_lock.
Signed-off-by: David Howells <dhowells@redhat.com>
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP,
so we could use it instead of the literal number '1'.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.
This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.
Without this change it will take several minutes until conntrack count
comes back to normal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conntrack gc worker to evict stale entries.
GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.
This means that a complete scan of the table will take several minutes
of wall-clock time.
Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.
We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.
We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.
A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.
v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.
v3: don't call cancel_delayed_work_sync twice (again, Eric).
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.
We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later. Considering that resizes are very rare it doesn't
seem worth doing it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.
Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').
Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.
During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.
The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.
Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.
This is the standard/expected way when conntrack entries are destroyed.
Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.
Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.
If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.
Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.
Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).
We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension. The worker can then skip all entries that are in
a different state. Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.
Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.
Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: ea781f197d ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have already use skb_header_pointer to get the ip header pointer,
so there's no need to use ip_hdr again. Moreover, in NETDEV INGRESS
hook, ip header maybe not linear, so use ip_hdr is not appropriate,
remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stop downgrading TDLS chandef when reaching the AP BW. The AP provides
the necessary regulatory protection in this case.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=153961, which
reported an infinite loop here.
Reported-by: Kamil Toman <kamil.toman@gmail.com>
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The addition of VLAN support caused a possible use of uninitialized
data if we encounter a zero TCA_FLOWER_KEY_ETH_TYPE key, as pointed
out by "gcc -Wmaybe-uninitialized":
net/sched/cls_flower.c: In function 'fl_change':
net/sched/cls_flower.c:366:22: error: 'ethertype' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This changes the code to only set the ethertype field if it
was nonzero, as before the patch.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9399ae9a6c ("net_sched: flower: Add vlan support")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.
While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.
Cause of the drops is the skb->truesize overestimation caused by :
- drivers allocating ~2048 (or more) bytes as a fragment to hold an
Ethernet frame.
- various pskb_may_pull() calls bringing the headers into skb->head
might have pulled all the frame content, but skb->truesize could
not be lowered, as the stack has no idea of each fragment truesize.
The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.
Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using replicast a UDP bearer can have an arbitrary amount of
remote ip addresses associated with it. This means we cannot simply
add all remote ip addresses to an existing bearer data message as it
might fill the message, leaving us with a truncated message that we
can't safely resume. To handle this we introduce the new netlink
command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be
called when the bearer data message has the
TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than
one remote ip (replicast).
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add UDP bearer options to netlink bearer get message. This is used by
the tipc user space tool to display UDP options.
The UDP bearer information is passed using either a sockaddr_in or
sockaddr_in6 structs. This means the user space receiver should
intermediately store the retrieved data in a large enough struct
(sockaddr_strage) before casting to the proper IP version type.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Automatically learn UDP remote IP addresses of communicating peers by
looking at the source IP address of incoming TIPC link configuration
messages (neighbor discovery).
This makes configuration slightly easier and removes the problematic
scenario where a node receives directly addressed neighbor discovery
messages sent using replicast which the node cannot "reply" to using
mutlicast, leaving the link FSM in a limbo state.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces UDP replicast. A concept where we emulate
multicast by sending multiple unicast messages to configured peers.
The purpose of replicast is mainly to be able to use TIPC in cloud
environments where IP multicast is disabled. Using replicas to unicast
multicast messages is costly as we have to copy each skb and send the
copies individually.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to check if a tipc UDP media address is a multicast
address or not. This is a purely cosmetic change.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the UDP send function into two. One callback that prepares the
skb and one transmit function that sends the skb. This will come in
handy in later patches, when we introduce UDP replicast.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the UDP netlink parse function so that it only parses one
netlink attribute at the time. This makes the parse function more
generic and allow future UDP API functions to use it for parsing.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2016-08-25
Here are a couple of important Bluetooth fixes for the 4.8 kernel:
- Memory leak fix for HCI requests
- Fix sk_filter handling with L2CAP
- Fix sock_recvmsg behavior when MSG_TRUNC is not set
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.
It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.
This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.
The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.
Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).
Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev_port_same_parent_id() currently expects port netdevs, but we
need it to support stacked devices in the next patch, so drop the
NO_RECURSE flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently in process_backlog(), the process_queue dequeuing is
performed with local IRQ disabled, to protect against
flush_backlog(), which runs in hard IRQ context.
This patch moves the flush operation to a work queue and runs the
callback with bottom half disabled to protect the process_queue
against dequeuing.
Since process_queue is now always manipulated in bottom half context,
the irq disable/enable pair around the dequeue operation are removed.
To keep the flush time as low as possible, the flush
works are scheduled on all online cpu simultaneously, using the
high priority work-queue and statically allocated, per cpu,
work structs.
Overall this change increases the time required to destroy a device
to improve slightly the packets reinjection performances.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I added support to export the vlan entry flags via xstats I forgot to
add support for the pvid since it is manually matched, so check if the
entry matches the vlan_group's pvid and set the flag appropriately.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nft_dump_register() should only be used with registers, not with
immediates.
Fixes: cb1b69b0b1 ("netfilter: nf_tables: add hash expression")
Fixes: 91dbc6be0a62("netfilter: nf_tables: add number generator expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.
This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>