Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:
* Better performance in nfnetlink_queue by avoiding copy from the
packet to netlink message, from Eric Dumazet.
* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.
* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.
* A couple of sparse fixes for IPVS, from Julian Anastasov.
* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
myself.
* Allow to dump expectations per master conntrack via ctnetlink, from myself.
* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
Popescu.
* Remove nf_conntrack module a bit faster if netns are in use, from
Vladimir Davydov.
* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.
* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When neither IFF_BRIDGE nor IFF_BRIDGE_PORT is set,
and afspec == NULL but protinfo != NULL, we run into
"if (err == 0) br_ifinfo_notify(RTM_NEWLINK, p);" with
random value in ret.
Thanks to Sergei for pointing out the error in commit comments.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for registering DSA switches using Device Tree
bindings. Note that we support programming the switch routing table even
though no in-tree user seems to require it. I tested this on Armada 370
with a Marvell 88E6172 (not supported by mainline yet).
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hello!
After patch 1 got accepted to net-next I will also send a patch to
netfilter-devel to make the corresponding changes to the netfilter
reassembly logic.
Thanks,
Hannes
-- >8 --
[PATCH 2/2] ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling
This patch also ensures that INET_ECN_CE is propagated if one fragment
had the codepoint set.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch just moves some code arround to make the ip4_frag_ecn_table
and IPFRAG_ECN_* constants accessible from the other reassembly engines. I
also renamed ip4_frag_ecn_table to ip_frag_ecn_table.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with
dev_base_seq to check if a change occurs during a netlink dump.
If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message
after the dump was interrupted.
Note that only dump of unicast addresses is checked (multicast and anycast are
not checked).
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch takes benefit of dev_addr_genid and dev_base_seq to check if a change
occurs during a netlink dump. If a change is detected, the flag NLM_F_DUMP_INTR
is set in the first message after the dump was interrupted.
Note that seq and prev_seq must be reset between each family in rtnl_dump_all()
because they are specific to each family.
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a cut and paste typo. We call ->ref() a second time instead
of ->deref().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull to get the thermal netlink multicast group name fix, otherwise
the assertion added in net-next to netlink to detect that kind of bug
makes systems unbootable for some folks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise build fails with CONFIG_DECNET && !CONFIG_DECNET_ROUTER
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A long standing problem with TSO is the fact that tcp_tso_should_defer()
rearms the deferred timer, while it should not.
Current code leads to following bad bursty behavior :
20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119
20:11:24.484337 IP B > A: . ack 263721 win 1117
20:11:24.485086 IP B > A: . ack 265241 win 1117
20:11:24.485925 IP B > A: . ack 266761 win 1117
20:11:24.486759 IP B > A: . ack 268281 win 1117
20:11:24.487594 IP B > A: . ack 269801 win 1117
20:11:24.488430 IP B > A: . ack 271321 win 1117
20:11:24.489267 IP B > A: . ack 272841 win 1117
20:11:24.490104 IP B > A: . ack 274361 win 1117
20:11:24.490939 IP B > A: . ack 275881 win 1117
20:11:24.491775 IP B > A: . ack 277401 win 1117
20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119
20:11:24.492620 IP B > A: . ack 278921 win 1117
20:11:24.493448 IP B > A: . ack 280441 win 1117
20:11:24.494286 IP B > A: . ack 281961 win 1117
20:11:24.495122 IP B > A: . ack 283481 win 1117
20:11:24.495958 IP B > A: . ack 285001 win 1117
20:11:24.496791 IP B > A: . ack 286521 win 1117
20:11:24.497628 IP B > A: . ack 288041 win 1117
20:11:24.498459 IP B > A: . ack 289561 win 1117
20:11:24.499296 IP B > A: . ack 291081 win 1117
20:11:24.500133 IP B > A: . ack 292601 win 1117
20:11:24.500970 IP B > A: . ack 294121 win 1117
20:11:24.501388 IP B > A: . ack 295641 win 1117
20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119
While the expected behavior is more like :
20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119
20:19:49.260446 IP B > A: . ack 154281 win 1212
20:19:49.261282 IP B > A: . ack 155801 win 1212
20:19:49.262125 IP B > A: . ack 157321 win 1212
20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119
20:19:49.262958 IP B > A: . ack 158841 win 1212
20:19:49.263795 IP B > A: . ack 160361 win 1212
20:19:49.264628 IP B > A: . ack 161881 win 1212
20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119
20:19:49.265465 IP B > A: . ack 163401 win 1212
20:19:49.265886 IP B > A: . ack 164921 win 1212
20:19:49.266722 IP B > A: . ack 166441 win 1212
20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119
20:19:49.267559 IP B > A: . ack 167961 win 1212
20:19:49.268394 IP B > A: . ack 169481 win 1212
20:19:49.269232 IP B > A: . ack 171001 win 1212
20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
decnet is the only subsystem left that is relying on the global
netlink attribute buffer rta_buf. It's horrible design and we
want to get rid of it.
This converts all of decnet to do implicit attribute parsing. It
also gets rid of the error prone struct dn_kern_rta.
Yes, the fib_magic() stuff is not pretty.
It's compiled tested but I need someone with appropriate hardware
to test the patch since I don't have access to it.
Cc: linux-decnet-user@lists.sourceforge.net
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to GRE tunnel, UDP tunnel should take care of IP header ID
too.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the previous discussion [1] on netdev list, DaveM insists
we should increase the IP header ID for each segmented packets.
This patch fixes it.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
1. http://marc.info/?t=136384172700001&r=1&w=2
Signed-off-by: David S. Miller <davem@davemloft.net>
The netlink_diag can be built as a module, just like it's done in
unix sockets.
The core dumping message carries the basic info about netlink sockets:
family, type and protocol, portis, dst_group, dst_portid, state.
Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.
Netlink sockets cab be filtered by protocols.
The socket inode number and cookie is reserved for future per-socket info
retrieving. The per-protocol filtering is also reserved for future by
requiring the sdiag_protocol to be zero.
The file /proc/net/netlink doesn't provide enough information for
dumping netlink sockets. It doesn't provide dst_group, dst_portid,
groups above 32.
v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move a few declarations in a header.
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cgroup code has been surrounded by ifdef CONFIG_NET_CLS_CGROUP
and CONFIG_NETPRIO_CGROUP.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements F-RTO (foward RTO recovery):
When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted. Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.
The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss(). The basic version is not supported
because SACK enhanced version also works for non-SACK connections.
The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.
Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate all of TCP CA_Loss state processing in
tcp_fastretrans_alert() into a new function called tcp_process_loss().
This is to prepare the new F-RTO implementation in the next patch.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch series refactor the F-RTO feature (RFC4138/5682).
This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features. It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).
The new code implements newer F-RTO RFC5682 using CA_Loss processing
path. F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently. F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.
The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, if you did an "ifconfig down" or similar on one core, and
the kernel had CONFIG_XFRM enabled, every core would be interrupted to
check its percpu flow list for items that could be garbage collected.
With this change, we generate a mask of cores that actually have any
percpu items, and only interrupt those cores. When we are trying to
isolate a set of cpus from interrupts, this is important to do.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is very useful to do dynamic truncation of packets. In particular,
we're interested to push the necessary header bytes to the user space and
cut off user payload that should probably not be transferred for some reasons
(e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET,
we can load it into the accumulator, and return it. E.g. in bpfc syntax ...
ld #poff ; { 0x20, 0, 0, 0xfffff034 },
ret a ; { 0x16, 0, 0, 0x00000000 },
... as a filter will accomplish this without having to do a big hackery in
a BPF filter itself. Follow-up JIT implementations are welcome.
Thanks to Eric Dumazet for suggesting and discussing this during the
Netfilter Workshop in Copenhagen.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_get_poff() returns the offset to the payload as far as it could
be dissected. The main user is currently BPF, so that we can dynamically
truncate packets without needing to push actual payload to the user
space and instead can analyze headers only.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes sure that release_sock is called for all error conditions in
irda_getsockopt.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_flow_dissect(), we perform a dissection of a skbuff. Since we're
doing the work here anyway, also store thoff for a later usage, e.g. in
the BPF filter.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we postpone unhashing of l2tp sessions until the structure is freed, we
risk:
1. further packets arriving and getting queued while the pseudowire is being
closed down
2. the recv path hitting "scheduling while atomic" errors in the case that
recv drops the last reference to a session and calls l2tp_session_free
while in atomic context
As such, l2tp sessions should be unhashed from l2tp_core data structures early
in the teardown process prior to calling pseudowire close. For pseudowires
like l2tp_ppp which have multiple shutdown codepaths, provide an unhash hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp's u64_stats writers were incorrectly synchronised, making it possible to
deadlock a 64bit machine running a 32bit kernel simply by sending the l2tp
code netlink commands while passing data through l2tp sessions.
Previous discussion on netdev determined that alternative solutions such as
spinlock writer synchronisation or per-cpu data would bring unjustified
overhead, given that most users interested in high volume traffic will likely
be running 64bit kernels on 64bit hardware.
As such, this patch replaces l2tp's use of u64_stats with atomic_long_t,
thereby avoiding the deadlock.
Ref:
http://marc.info/?l=linux-netdev&m=134029167910731&w=2http://marc.info/?l=linux-netdev&m=134079868111131&w=2
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If userspace deletes a ppp pseudowire using the netlink API, either by
directly deleting the session or by deleting the tunnel that contains the
session, we need to tear down the corresponding pppox channel.
Rather than trying to manage two pppox unbind codepaths, switch the netlink
and l2tp_core session_close handlers to close via. the l2tp_ppp socket
.release handler.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add calls to l2tp_session_queue_purge as a part of l2tp_tunnel_closeall
and l2tp_session_delete. Pseudowire implementations which are deleted only
via. l2tp_core l2tp_session_delete calls can dispense with their own code for
flushing the reorder queue.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an l2tp session is deleted, it is necessary to delete skbs in-flight
on the session's reorder queue before taking it down.
Rather than having each pseudowire implementation reaching into the
l2tp_session struct to handle this itself, provide a function in l2tp_core to
purge the session queue.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is valid for an existing struct sock object to have a NULL sk_socket
pointer, so don't BUG_ON in l2tp_tunnel_del_work if that should occur.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When looking up the tunnel socket in struct l2tp_tunnel, hold a reference
whether the socket was created by the kernel or by userspace.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user deletes a tunnel using netlink, all the sessions in the tunnel
should also be deleted. Since running sessions will pin the tunnel socket
with the references they hold, have the l2tp_tunnel_delete close all sessions
in a tunnel before finally closing the tunnel socket.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_core hooks UDP's .destroy handler to gain advance warning of a tunnel
socket being closed from userspace. We need to do the same thing for
IP-encapsulation sockets.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_core internally uses l2tp_tunnel_closeall to close all sessions in a
tunnel when a UDP-encapsulation socket is destroyed. We need to do something
similar for IP-encapsulation sockets.
Export l2tp_tunnel_closeall as a GPL symbol to enable l2tp_ip and l2tp_ip6 to
call it from their .destroy handlers.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L2TP sessions hold a reference to the tunnel socket to prevent it going away
while sessions are still active. However, since tunnel destruction is handled
by the sock sk_destruct callback there is a catch-22: a tunnel with sessions
cannot be deleted since each session holds a reference to the tunnel socket.
If userspace closes a managed tunnel socket, or dies, the tunnel will persist
and it will be neccessary to individually delete the sessions using netlink
commands. This is ugly.
To prevent this occuring, this patch leverages the udp encapsulation socket
destroy callback to gain early notification when the tunnel socket is closed.
This allows us to safely close the sessions running in the tunnel, dropping
the tunnel socket references in the process. The tunnel socket is then
destroyed as normal, and the tunnel resources deallocated in sk_destruct.
While we're at it, ensure that l2tp_tunnel_closeall correctly drops session
references to allow the sessions to be deleted rather than leaking.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.
In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.
This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trigger BUG_ON if a group name is longer than GENL_NAMSIZ.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:
* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
complaining that this target does not work in the nat table, which is the
wrong table for it, from Florian Westphal.
* Fix possible use before initialization in the netns init path of several
conntrack protocol trackers (introduced recently while improving conntrack
netns support), from Gao Feng.
* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
by Eric Dumazet during the NFWS2013, patch from myself.
* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.
* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
not required anymore after change introduced in 3.7, again from Julian.
* Fix SYN looping in IPVS state sync if the backup is used a real server
in DR/TUN modes, this required a new /proc entry to disable the director
function when acting as backup, also from Julian.
* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
Paul Bolle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kconfig symbol IP_NF_QUEUE is unused since commit
d16cf20e2f ("netfilter: remove ip_queue
support"). Let's remove it too.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Changes:
v3->v2: rebase (no other changes)
passes selftest
v2->v1: read f->num_members only once
fix bug: test rollover mode + flag
Minimize packet drop in a fanout group. If one socket is full,
roll over packets to another from the group. Maintain flow
affinity during normal load using an rxhash fanout policy, while
dispersing unexpected traffic storms that hit a single cpu, such
as spoofed-source DoS flows. Rollover breaks affinity for flows
arriving at saturated sockets during those conditions.
The patch adds a fanout policy ROLLOVER that rotates between sockets,
filling each socket before moving to the next. It also adds a fanout
flag ROLLOVER. If passed along with any other fanout policy, the
primary policy is applied until the chosen socket is full. Then,
rollover selects another socket, to delay packet drop until the
entire system is saturated.
Probing sockets is not free. Selecting the last used socket, as
rollover does, is a greedy approach that maximizes chance of
success, at the cost of extreme load imbalance. In practice, with
sufficiently long queues to absorb bursts, sockets are drained in
parallel and load balance looks uniform in `top`.
To avoid contention, scales counters with number of sockets and
accesses them lockfree. Values are bounds checked to ensure
correctness.
Tested using an application with 9 threads pinned to CPUs, one socket
per thread and sufficient busywork per packet operation to limits each
thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
packets, a FANOUT_CPU setup processes 32 Kpps in total without this
patch, 270 Kpps with the patch. Tested with read() and with a packet
ring (V1).
Also, passes psock_fanout.c unit test added to selftests.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix ARM BPF JIT handling of negative 'k' values, from Chen Gang.
2) Insufficient space reserved for bridge netlink values, fix from
Stephen Hemminger.
3) Some dst_neigh_lookup*() callers don't interpret error pointer
correctly, fix from Zhouyi Zhou.
4) Fix transport match in SCTP active_path loops, from Xugeng Zhang.
5) Fix qeth driver handling of multi-order SKB frags, from Frank
Blaschka.
6) fec driver is missing napi_disable() call, resulting in crashes on
unload, from Georg Hofmann.
7) Don't try to handle PMTU events on a listening socket, fix from Eric
Dumazet.
8) Fix timestamp location calculations in IP option processing, from
David Ward.
9) FIB_TABLE_HASHSZ setting is not controlled by the correct kconfig
tests, from Denis V Lunev.
10) Fix TX descriptor push handling in SFC driver, from Ben Hutchings.
11) Fix isdn/hisax and tulip/de4x5 kconfig dependencies, from Arnd
Bergmann.
12) bnx2x statistics don't handle 4GB rollover correctly, fix from
Maciej Żenczykowski.
13) Openvswitch bug fixes for vport del/new error reporting, missing
genlmsg_end() call in netlink processing, and mis-parsing of
LLC/SNAP ethernet types. From Rich Lane.
14) SKB pfmemalloc state should only be propagated from the head page of
a compound page, fix from Pavel Emelyanov.
15) Fix link handling in tg3 driver for 5715 chips when autonegotation
is disabled. From Nithin Sujir.
16) Fix inverted test of cpdma_check_free_tx_desc return value in
davinci_emac driver, from Mugunthan V N.
17) vlan_depth is incorrectly calculated in skb_network_protocol(), from
Li RongQing.
18) Fix probing of Gobi 1K devices in qmi_wwan driver, and fix NCM
device mode backwards compat in cdc_ncm driver. From Bjørn Mork.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
inet: limit length of fragment queue hash table bucket lists
qeth: Fix scatter-gather regression
qeth: Fix invalid router settings handling
qeth: delay feature trace
tcp: dont handle MTU reduction on LISTEN socket
bnx2x: fix occasional statistics off-by-4GB error
vhost/net: fix heads usage of ubuf_info
bridge: Add support for setting BR_ROOT_BLOCK flag.
bnx2x: add missing napi deletion in error path
drivers: net: ethernet: ti: davinci_emac: fix usage of cpdma_check_free_tx_desc()
ethernet/tulip: DE4x5 needs VIRT_TO_BUS
isdn: hisax: netjet requires VIRT_TO_BUS
net: cdc_ncm, cdc_mbim: allow user to prefer NCM for backwards compatibility
rtnetlink: Mask the rta_type when range checking
Revert "ip_gre: make ipgre_tunnel_xmit() not parse network header as IP unconditionally"
Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug
smsc75xx: configuration help incorrectly mentions smsc95xx
net: fec: fix missing napi_disable call
net: fec: restart the FEC when PHY speed changes
skb: Propagate pfmemalloc on skb from head page only
...
The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.
I've measured cleanup time for 1k dummy net ns. Here are the results:
<without the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack
real 0m10.337s
user 0m0.000s
sys 0m0.376s
<with the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack
real 0m5.661s
user 0m0.000s
sys 0m0.216s
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Include header file to pickup prototype of nf_nat_seq_adjust_hook
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nfqnl_build_packet_message() actually copy the packet
inside the netlink message, while it can instead use
zero copy.
Make sure the skb 'copy' is the last component of the
cooked netlink message, as we cant add anything after it.
Patch cooked in Copenhagen at Netfilter Workshop ;)
Still to be addressed in separate patches :
-GRO/GSO packets are segmented in nf_queue()
and checksummed in nfqnl_build_packet_message().
Proper support for GSO/GRO packets (no segmentation,
and no checksumming) needs application cooperation, if we
want no regressions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>