Geert reported issues regarding checksum complete and UDP.
The logic introduced in commit 7e3cead517
("net: Save software checksum complete") is not correct.
This patch:
1) Restores code in __skb_checksum_complete_header except for setting
CHECKSUM_UNNECESSARY. This function may be calculating checksum on
something less than skb->len.
2) Adds saving checksum to __skb_checksum_complete. The full packet
checksum 0..skb->len is calculated without adding in pseudo header.
This value is saved in skb->csum and then the pseudo header is added
to that to derive the checksum for validation.
3) In both __skb_checksum_complete_header and __skb_checksum_complete,
set skb->csum_valid to whether checksum of zero was computed. This
allows skb_csum_unnecessary to return true without changing to
CHECKSUM_UNNECESSARY which was done previously.
4) Copy new csum related bits in __copy_skb_header.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.
The reason is a kernel<->userspace API change introduced by commit
88c5b5ce5c ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.
iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").
The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2
(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)
We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.
With this patch "ip link" shows VF information in both old and new
iproute2 versions.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/core/rtnetlink.c
net/core/skbuff.c
Both conflicts were very simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state
control to the netlink VF nested structure, but failed to add a proper entry
for the new structure into the VF policy table. Add the missing entry so
the table and the actual data copied into the netlink nested struct are in
sync.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.
Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several instances where a pskb_copy or __pskb_copy is
immediately followed by an skb_clone.
Add a couple of new functions to allow the copy skb to be allocated
from the fclone cache and thus speed up subsequent skb_clone calls.
Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Allan Stephens <allan.stephens@windriver.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix compiler warning on 32-bit architectures:
net/core/filter.c: In function '__sk_run_filter':
net/core/filter.c:540:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:550:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:560:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a kernel BUG_ON in skb_segment. It is hit when
testing two VMs on openvswitch with one VM acting as VXLAN gateway.
During VXLAN packet GSO, skb_segment is called with skb->data
pointing to inner TCP payload. skb_segment calls skb_network_protocol
to retrieve the inner protocol. skb_network_protocol actually expects
skb->data to point to MAC and it calls pskb_may_pull with ETH_HLEN.
This ends up pulling in ETH_HLEN data from header tail. As a result,
pskb_trim logic is skipped and BUG_ON is hit later.
Move skb_push in front of skb_network_protocol so that skb->data
lines up properly.
kernel BUG at net/core/skbuff.c:2999!
Call Trace:
[<ffffffff816ac412>] tcp_gso_segment+0x122/0x410
[<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390
[<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170
[<ffffffff816b3658>] skb_udp_tunnel_segment+0xd8/0x390
[<ffffffff816b3c00>] udp4_ufo_fragment+0x120/0x140
[<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390
[<ffffffff8109d742>] ? default_wake_function+0x12/0x20
[<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170
[<ffffffff8164b4d0>] __skb_gso_segment+0x60/0xc0
[<ffffffff8164b6b3>] dev_hard_start_xmit+0x183/0x550
[<ffffffff8166c91e>] sch_direct_xmit+0xfe/0x1d0
[<ffffffff8164bc94>] __dev_queue_xmit+0x214/0x4f0
[<ffffffff8164bf90>] dev_queue_xmit+0x10/0x20
[<ffffffff81687edb>] ip_finish_output+0x66b/0x890
[<ffffffff81688a58>] ip_output+0x58/0x90
[<ffffffff816c628f>] ? fib_table_lookup+0x29f/0x350
[<ffffffff816881c9>] ip_local_out_sk+0x39/0x50
[<ffffffff816cbfad>] iptunnel_xmit+0x10d/0x130
[<ffffffffa0212200>] vxlan_xmit_skb+0x1d0/0x330 [vxlan]
[<ffffffffa02a3919>] vxlan_tnl_send+0x129/0x1a0 [openvswitch]
[<ffffffffa02a2cd6>] ovs_vport_send+0x26/0xa0 [openvswitch]
[<ffffffffa029931e>] do_output+0x2e/0x50 [openvswitch]
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The macro 'A' used in internal BPF interpreter:
#define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.
This patch is trying to clean up the naming and clarify its usage in the
following way:
- A and X are names of two classic BPF registers
- BPF_REG_A denotes internal BPF register R0 used to map classic register A
in internal BPF programs generated from classic
- BPF_REG_X denotes internal BPF register R7 used to map classic register X
in internal BPF programs generated from classic
- internal BPF instruction format:
struct sock_filter_int {
__u8 code; /* opcode */
__u8 dst_reg:4; /* dest register */
__u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
BPF_X - means use register X as source operand
BPF_K - means use 32-bit immediate as source operand
In internal:
BPF_X - means use 'src_reg' register as source operand
BPF_K - means use 32-bit immediate as source operand
Suggested-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
unregister_netdevice_many() API is error prone and we had too
many bugs because of dangling LIST_HEAD on stacks.
See commit f87e6f4793 ("net: dont leave active on stack LIST_HEAD")
In fact, instead of making sure no caller leaves an active list_head,
just force a list_del() in the callee. No one seems to need to access
the list after unregister_netdevice_many()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
Conflicts:
drivers/net/xen-netback/netback.c
net/core/filter.c
A filter bug fix overlapped some cleanups and a conversion
over to some new insn generation macros.
A xen-netback bug fix overlapped the addition of multi-queue
support.
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since
pkt_type_offset() was failing to find skb->pkt_type field which is defined as:
__u8 pkt_type:3,
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1;
Fix it by searching for 3 most significant bits and shift them by 5 at run-time
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an MPLS packet requires segmentation then use mpls_features
to determine if the software implementation should be used.
As no driver advertises MPLS GSO segmentation this will always be
the case.
I had not noticed that this was necessary before as software MPLS GSO
segmentation was already being used in my test environment. I believe that
the reason for that is the skbs in question always had fragments and the
driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
case for most drivers). Thus software segmentation was activated by
skb_gso_ok().
This introduces the overhead of an extra call to skb_network_protocol()
in the case where where CONFIG_NET_MPLS_GSO is set and
skb->ip_summed == CHECKSUM_NONE.
Thanks to Jesse Gross for prompting me to investigate this.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is available since v3.15-rc5.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.
This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.
This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we jump to free_pcpu on failure in alloc_netdev_mqs()
rx and tx queues are not yet allocated, so no need to free them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible that ->newlink() fails before registering
the device, in this case we should just free it, it's
safe to call free_netdev().
Fixes: commit 0e0eee2465 (net: correct error path in rtnl_newlink())
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings says:
====================
Pull request: Fixes for new ethtool RSS commands
This addresses several problems I previously identified with the new
ETHTOOL_{G,S}RSSH commands:
1. Missing validation of reserved parameters
2. Vague documentation
3. Use of unnamed magic number
4. No consolidation with existing driver operations
I don't currently have access to suitable network hardware, but have
tested these changes with a dummy driver that can support various
combinations of operations and sizes, together with (a) Debian's ethtool
3.13 (b) ethtool 3.14 with the submitted patch to use ETHTOOL_{G,S}RSSH
and minor adjustment for fixes 1 and 3.
v2: Update RSS operations in vmxnet3 too
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
regardless of whether they expose the hash key, unless you try to
set a hash key for a driver that doesn't expose it.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
__sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
rework/optimize internal BPF interpreter's instruction set) so that it should
have uncharged memory once things went wrong. However that work isn't complete.
Error is handled only in __sk_migrate_filter() while memory can still leak in
the error path right after sk_chk_filter().
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change provides a function to be used in order to break the
ndo_set_rx_mode call into a set of address add and remove calls. The code
is based on the implementation of dev_uc_sync/dev_mc_sync. Since they
essentially do the same thing but with only one dev I simply named my
functions __dev_uc_sync/__dev_mc_sync.
I also implemented an unsync version of the functions as well to allow for
cleanup on close.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9739eef13c ("net: filter: make BPF conversion more readable")
started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
macros from classic BPF.
However, quite some statements in the filter conversion functions
remained in the old style which gives a mixture of block macros and
non block macros in the code. This patch makes the block macros itself
more readable by using explicit member initialization, and converts
the remaining ones where possible to remain in a more consistent state.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After 1e785f48d2 ("net: Start with correct mac_len in
skb_network_protocol") skb->mac_len is used as a start of the
calculation in skb_network_protocol() but that is not always correct. If
skb->protocol == 8021Q/AD, usually the vlan header is already inserted
in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
destination mac address (skb->data + VLAN_HLEN) as a type in
skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
skb_network_protocol() returns a type from inside the packet and
offset == 22. Also make vlan_depth unsigned as suggested before.
As suggested by Eric Dumazet, move the while() loop in the if() so we
can avoid additional testing in fast path.
Here are few netperf tests + debug printk's to illustrate:
cat netperf.tso-on.reorder-on.bugged
- Vlan -> device (reorder on, default, this case is okay)
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 7111.54
[ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
skb->mac_len 0 vlan_depth 0 type 0x800
- Vlan -> device (reorder off, bad)
cat netperf.tso-on.reorder-off.bugged
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 241.35
[ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
skb->mac_len 0 vlan_depth 4 type 0x5301
0x5301 are the last two bytes of the destination mac.
And if we stop TSO, we may get even the following:
[ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 18 vlan_depth 22 type 0xb84
Because mac_len already accounts for VLAN_HLEN.
After the fix:
cat netperf.tso-on.reorder-off.fixed
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.01 5001.46
[ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 0 vlan_depth 18 type 0x800
CC: Vlad Yasevich <vyasevic@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Fixes:1e785f48d29a ("net: Start with correct mac_len in
skb_network_protocol")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_unattached_filter_create() API is used by BPF filters that
are not directly attached or related to sockets, and are used in
team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
internal managment of obtaining filter blocks and thus already
have them in kernel memory and set up before calling into
sk_unattached_filter_create(). As a result, due to __user annotation
in sock_fprog, sparse triggers false positives (incorrect type in
assignment [different address space]) when filters are set up before
passing them to sk_unattached_filter_create(). Therefore, let
sk_unattached_filter_create() API use sock_fprog_kern to overcome
this issue.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lets get rid of this macro. After commit 5bcfedf06f ("net: filter:
simplify label names from jump-table"), labels have become more
readable due to omission of BPF_ prefix but at the same time more
generic, so that things like `git grep -n` would not find them. As
a middle path, lets get rid of the DL macro as it's not strictly
needed and would otherwise just hide the full name.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed
to have a bandwidth of at least this value.
max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth
of up to this value.
o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced
which takes 4 arguments:
netdev, VF number, min_tx_rate, max_tx_rate
o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler.
o Drivers that currently implement ndo_set_vf_tx_rate should now call
ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth
greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet
implemented by driver.
o If user enters only one of either min_tx_rate or max_tx_rate, then,
userland should read back the other value from driver and set both
for IFLA_VF_RATE.
Drivers that have not yet implemented IFLA_VF_RATE should always
return min_tx_rate as 0 when read from ip tool.
o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then
IFLA_VF_RATE should override.
o Idea is to have consistent display of rate values to user.
o Usage example: -
./ip link set p4p1 vf 0 rate 900
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
./ip link set p4p1 vf 0 max_tx_rate 600 rate 300
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although the implementation probably needs a lot of work, this initial API
allows to implement software TSO in mvneta and mv643xx_eth drivers in a not
so intrusive way.
Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel API for classic BPF socket filters is:
sk_unattached_filter_create() - validate classic BPF, convert, JIT
SK_RUN_FILTER() - run it
sk_unattached_filter_destroy() - destroy socket filter
Cleanup internal BPF kernel API as following:
sk_filter_select_runtime() - final step of internal BPF creation.
Try to JIT internal BPF program, if JIT is not available select interpreter
SK_RUN_FILTER() - run it
sk_filter_free() - free internal BPF program
Disallow direct calls to BPF interpreter. Execution of the BPF program should
be done with SK_RUN_FILTER() macro.
Example of internal BPF create, run, destroy:
struct sk_filter *fp;
fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
fp->len = prog_len;
sk_filter_select_runtime(fp);
SK_RUN_FILTER(fp, ctx);
sk_filter_free(fp);
Sockets, seccomp, testsuite, tracing are using different ways to populate
sk_filter, so first steps of program creation are not common.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This would be a no-op, so there is no reason to request it.
This also allows conversion of the current implementations of
ethtool_ops::{get,set}_rxfh_indir to ethtool_ops::{get,set}_rxfh
with no change other than their parameters.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We usually allocate special values of u32 fields starting from the top
down, so also change the value to 0xffffffff. As these operations
haven't been included in a stable release yet, it's not too late to
change.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We must return -EFAULT immediately rather than continuing into
the loop.
Similarly, we may as well return -EINVAL directly.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Prior to commit fbd929f2dc
bonding: support QinQ for bond arp interval
the arp monitoring code allowed for proper detection of devices
stacked on top of vlans. Since the above commit, the
code can still detect a device stacked on top of single
vlan, but not a device stacked on top of Q-in-Q configuration.
The search will only set the inner vlan tag if the route
device is the vlan device. However, this is not always the
case, as it is possible to extend the stacked configuration.
With this patch it is possible to provision devices on
top Q-in-Q vlan configuration that should be used as
a source of ARP monitoring information.
For example:
ip link add link bond0 vlan10 type vlan proto 802.1q id 10
ip link add link vlan10 vlan100 type vlan proto 802.1q id 100
ip link add link vlan100 type macvlan
Note: This patch limites the number of stacked VLANs to 2,
just like before. The original, however had another issue
in that if we had more then 2 levels of VLANs, we would end
up generating incorrectly tagged traffic. This is no longer
possible.
Fixes: fbd929f2dc (bonding: support QinQ for bond arp interval)
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Patric McHardy <kaber@trash.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit dc8eaaa006.
vlan: Fix lockdep warning when vlan dev handle notification
Instead we use the new new API to find the lock subclass of
our vlan device. This way we can support configurations where
vlans are interspersed with other devices:
bond -> vlan -> macvlan -> vlan
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiple devices in the kernel can be stacked/nested and they
need to know their nesting level for the purposes of lockdep.
This patch provides a generic function that determines a nesting
level of a particular device by its type (ex: vlan, macvlan, etc).
We only care about nesting of the same type of devices.
For example:
eth0 <- vlan0.10 <- macvlan0 <- vlan1.20
The nesting level of vlan1.20 would be 1, since there is another vlan
in the stack under it.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting from linux-3.13, GRO attempts to build full size skbs.
Problem is the commit assumed one particular field in skb->cb[]
was clean, but it is not the case on some stacked devices.
Timo reported a crash in case traffic is decrypted before
reaching a GRE device.
Fix this by initializing NAPI_GRO_CB(skb)->last at the right place,
this also removes one conditional.
Thanks a lot to Timo for providing full reports and bisecting this.
Fixes: 8a29111c7c ("net: gro: allow to build full sized skb")
Bisected-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.
This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.
v2: New usage from device_cgroup.c converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Cong Wang <cwang@twopensource.com>
commit 50624c934d (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.
For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>