Commit Graph

785583 Commits

Author SHA1 Message Date
Daniel Borkmann
09d62154f6 tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers
Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
respectively) when processing events from the perf ring buffer which
is unnecessarily expensive as we can do more lightweight in particular
given this is critical fast-path in perf.

According to Peter rmb()/mb() were added back then via a94d342b9c
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.

While for x86-64, replacing rmb() and mb() with smp_*() variants would
result in just a compiler barrier for the former and LOCK + ADD for
the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
suggested we can use smp_{load_acquire,store_release}() instead for
architectures where its implementation doesn't resolve in slower smp_mb().
Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
to TSO. For architectures where the latter needs to use smp_mb() e.g.
on arm, we stick to cheaper smp_rmb() variant for fetching the head.

This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
for tools infrastructure that either switches to smp_load_acquire() for
architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
for those where it's not in order to fetch the data_head from the perf
control page, and it uses smp_store_release() to write the data_tail.
Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
further improve performance in a follow up step by implementing the two
under tools/arch/*/include/asm/barrier.h such that they don't have to
fallback to rmb() and mb() in tools/include/asm/barrier.h.

Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
so it can make use of the optimizations. Later, we convert libbpf as
well to use the same helpers.

Side note [0]: the topic has been raised of whether one could simply use
the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
instead:

  __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
  __atomic_store_n(ptr, val, __ATOMIC_RELEASE);

Kernel and (presumably) tooling shipped along with the kernel has a
minimum requirement of being able to build with gcc-4.6 and the latter
does not have C11 builtins. While generally the C11 memory models don't
align with the kernel's, the C11 load-acquire and store-release alone
/could/ suffice, however. Issue is that this is implementation dependent
on how the load-acquire and store-release is done by the compiler and
the mapping of supported compilers must align to be compatible with the
kernel's implementation, and thus needs to be verified/tracked on a
case by case basis whether they match (unless an architecture uses them
also from kernel side). The implementations for smp_load_acquire() and
smp_store_release() in this patch have been adapted from the kernel side
ones to have a concrete and compatible mapping in place.

  [0] http://patchwork.ozlabs.org/patch/985422/
  [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:43:08 -07:00
Anders Roxell
78de35460a selftests/bpf: add missing executables to .gitignore
Fixes: 371e4fcc9d ("selftests/bpf: cgroup local storage-based network counters")
Fixes: 370920c47b ("selftests/bpf: Test libbpf_{prog,attach}_type_by_name")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:31:27 -07:00
Alexei Starovoitov
43ed375ff2 Merge branch 'queue_stack_maps'
Mauricio Vasquez says:

====================
In some applications this is needed have a pool of free elements, for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possible to get any element without having they key
it is associated to, even if it were possible, the lack of locking mecanishms in
eBPF would do it almost impossible to be implemented without data races.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>

v2 -> v3:
 - Remove "almost dead code" in syscall.c
 - Remove unnecessary copy_from_user in bpf_map_lookup_and_delete_elem
 - Rebase

v1 -> v2:
 - Put ARG_PTR_TO_UNINIT_MAP_VALUE logic into a separated patch
 - Fix missing __this_cpu_dec & preempt_enable calls in kernel/bpf/syscall.c

RFC v4 -> v1:
 - Remove roundup to power of 2 in memory allocation
 - Remove count and use a free slot to check if queue/stack is empty
 - Use if + assigment for wrapping indexes
 - Fix some minor style issues
 - Squash two patches together

RFC v3 -> RFC v4:
 - Revert renaming of kernel/bpf/stackmap.c
 - Remove restriction on value size
 - Remove len arguments from peek/pop helpers
 - Add new ARG_PTR_TO_UNINIT_MAP_VALUE

RFC v2 -> RFC v3:
 - Return elements by value instead that by reference
 - Implement queue/stack base on array and head + tail indexes
 - Rename stack trace related files to avoid confusion and conflicts

RFC v1 -> RFC v2:
 - Create two separate maps instead of single one + flags
 - Implement bpf_map_lookup_and_delete syscall
 - Support peek operation
 - Define replacement policy through flags in the update() method
 - Add eBPF side tests
====================

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:32 -07:00
Mauricio Vasquez B
43b987d23d selftests/bpf: add test cases for queue and stack maps
test_maps:
Tests that queue/stack maps are behaving correctly even in corner cases

test_progs:
Tests new ebpf helpers

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
da4e1b15f6 Sync uapi/bpf.h to tools/include
Sync both files.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
bd513cd08f bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
The previous patch implemented a bpf queue/stack maps that
provided the peek/pop/push functions.  There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
f1a2e44a3a bpf: add queue and stack maps
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM            -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM            -> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
2ea864c58f bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE
ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
used to save the value of a map.  Basically the same as
ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
argument.

This will be used in the following patch that implements some new
helpers that receive a pointer to be filled with a map value.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
c9d29f4658 bpf/syscall: allow key to be null in map functions
This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.

A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:31 -07:00
Mauricio Vasquez B
144991602e bpf: rename stack trace map operations
In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented.  In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops

Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-19 13:24:30 -07:00
Jakub Kicinski
3ddeac6705 tools: bpftool: use 4 context mode for the NFP disasm
The nfp driver is currently always JITing the BPF for 4 context/thread
mode of the NFP flow processors.  Tell this to the disassembler,
otherwise some registers may be incorrectly decoded.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-18 22:16:02 +02:00
Peng Hao
1bd70d2eba selftests/bpf: fix file resource leak in load_kallsyms
FILE pointer variable f is opened but never closed.

Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-18 22:13:30 +02:00
Nicolas Dichtel
b55cbc8d9b bpf: fix doc of bpf_skb_adjust_room() in uapi
len_diff is signed.

Fixes: fa15601ab3 ("bpf: add documentation for eBPF helpers (33-41)")
CC: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-17 21:45:50 -07:00
Daniel Borkmann
44d520eb17 Merge branch 'bpf-sk-msg-peek'
John Fastabend says:

====================
This adds support for the MSG_PEEK flag when redirecting into
an ingress psock sk_msg queue.

The first patch adds some base support to the helpers, then the
feature, and finally we add an option for the test suite to do
a duplicate MSG_PEEK call on every recv to test the feature.

With duplicate MSG_PEEK call all tests continue to PASS.
====================

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-17 02:30:33 +02:00
John Fastabend
753fb2ee09 bpf: sockmap, add msg_peek tests to test_sockmap
Add tests that do a MSG_PEEK recv followed by a regular receive to
test flag support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-17 02:30:32 +02:00
John Fastabend
02c558b2d5 bpf: sockmap, support for msg_peek in sk_msg with redirect ingress
This adds support for the MSG_PEEK flag when doing redirect to ingress
and receiving on the sk_msg psock queue. Previously the flag was
being ignored which could confuse applications if they expected the
flag to work as normal.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-17 02:30:32 +02:00
John Fastabend
8734a162c1 bpf: skmsg, improve sk_msg_used_element to work in cork context
Currently sk_msg_used_element is only called in zerocopy context where
cork is not possible and if this case happens we fallback to copy
mode. However the helper is more useful if it works in all contexts.

This patch resolved the case where if end == head indicating a full
or empty ring the helper always reports an empty ring. To fix this
add a test for the full ring case to avoid reporting a full ring
has 0 elements. This additional functionality will be used in the
next patches from recvmsg context where end = head with a full ring
is a valid case.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-17 02:30:31 +02:00
John Fastabend
3f4c3127d3 bpf: sockmap, fix skmsg recvmsg handler to track size correctly
When converting sockmap to new skmsg generic data structures we missed
that the recvmsg handler did not correctly use sg.size and instead was
using individual elements length. The result is if a sock is closed
with outstanding data we omit the call to sk_mem_uncharge() and can
get the warning below.

[   66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210

To fix this correct the redirect handler to xfer the size along with
the scatterlist and also decrement the size from the recvmsg handler.
Now when a sock is closed the remaining 'size' will be decremented
with sk_mem_uncharge().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-17 02:29:15 +02:00
Alexei Starovoitov
9032c10e94 Merge branch 'nfp-improve-bpf-offload'
Jakub Kicinski says:

====================
this set adds check to make sure offload behaviour is correct.
First when atomic counters are used, we must make sure the map
does not already contain data we did not prepare for holding
atomics.

Second patch double checks vNIC capabilities for program offload
in case program is shared by multiple vNICs with different
constraints.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-16 15:23:59 -07:00
Jakub Kicinski
44b6fed0c1 nfp: bpf: double check vNIC capabilities after object sharing
Program translation stage checks that program can be offloaded to
the netdev which was passed during the load (bpf_attr->prog_ifindex).
After program sharing was introduced, however, the netdev on which
program is loaded can theoretically be different, and therefore
we should recheck the program size and max stack size at load time.

This was found by code inspection, AFAIK today all vNICs have
identical caps.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-16 15:23:58 -07:00
Jakub Kicinski
527db74b71 nfp: bpf: protect against mis-initializing atomic counters
Atomic operations on the NFP are currently always in big endian.
The driver keeps track of regions of memory storing atomic values
and byte swaps them accordingly.  There are corner cases where
the map values may be initialized before the driver knows they
are used as atomic counters.  This can happen either when the
datapath is performing the update and the stack contents are
unknown or when map is updated before the program which will
use it for atomic values is loaded.

To avoid situation where user initializes the value to 0 1 2 3
and then after loading a program which uses the word as an atomic
counter starts reading 3 2 1 0 - only allow atomic counters to be
initialized to endian-neutral values.

For updates from the datapath the stack information may not be
as precise, so just allow initializing such values to 0.

Example code which would break:
struct bpf_map_def SEC("maps") rxcnt = {
       .type = BPF_MAP_TYPE_HASH,
       .key_size = sizeof(__u32),
       .value_size = sizeof(__u64),
       .max_entries = 1,
};

int xdp_prog1()
{
      	__u64 nonzeroval = 3;
	__u32 key = 0;
	__u64 *value;

	value = bpf_map_lookup_elem(&rxcnt, &key);
	if (!value)
		bpf_map_update_elem(&rxcnt, &key, &nonzeroval, BPF_ANY);
	else
		__sync_fetch_and_add(value, 1);

	return XDP_PASS;
}

$ offload bpftool map dump
key: 00 00 00 00 value: 00 00 00 03 00 00 00 00

should be:

$ offload bpftool map dump
key: 00 00 00 00 value: 03 00 00 00 00 00 00 00

Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-16 15:23:58 -07:00
Andrey Ignatov
ab9e084821 libbpf: Per-symbol visibility for DSO
Make global symbols in libbpf DSO hidden by default with
-fvisibility=hidden and export symbols that are part of ABI explicitly
with __attribute__((visibility("default"))).

This is common practice that should prevent from accidentally exporting
a symbol, that is not supposed to be a part of ABI what, in turn,
improves both libbpf developer- and user-experiences. See [1] for more
details.

Export control becomes more important since more and more projects use
libbpf.

The patch doesn't export a bunch of netlink related functions since as
agreed in [2] they'll be reworked. That doesn't break bpftool since
bpftool links libbpf statically.

[1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control)
[2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.html

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-16 15:16:47 -07:00
Daniel Borkmann
421f4292f4 bpf, tls: add tls header to tools infrastructure
Andrey reported a build error for the BPF kselftest suite when compiled on
a machine which does not have tls related header bits installed natively:

  test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
   #include <linux/tls.h>
                         ^
  compilation terminated.

Fix it by adding the header to the tools include infrastructure and add
definitions such as SOL_TLS that could potentially be missing.

Fixes: e9dd904708 ("bpf: add tls support for testing in test_sockmap")
Reported-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-16 08:10:01 -07:00
David S. Miller
2c59f06cc0 Merge branch 'net-Kernel-side-filtering-for-route-dumps'
David Ahern says:

====================
net: Kernel side filtering for route dumps

Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.

iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.

v2
- better handling of requests for a specific table. Rather than walking
  the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
  helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
  it is returned even when the dump returns nothing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:14:18 -07:00
David Ahern
e4e92fb160 net/ipv4: Bail early if user only wants prefix entries
Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
flag is set in the dump request, just return.

In the process of this change, move the CLONE check to use the new
filter flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:14:08 -07:00
David Ahern
08e814c9e8 net/ipv6: Bail early if user only wants cloned entries
Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
requests a route dump for only cloned entries, no sense walking the FIB
and returning everything.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:14:07 -07:00
David Ahern
196cfebf89 net/mpls: Handle kernel side filtering of route dumps
Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:14:07 -07:00
David Ahern
effe679266 net: Enable kernel side filtering of route dumps
Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.

ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.

Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:14:07 -07:00
David Ahern
cb167893f4 net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
Implement kernel side filtering of routes by egress device index and
table id. If the table id is given in the filter, lookup table and
call mr_table_dump directly for it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:39 -07:00
David Ahern
e1cedae1ba ipmr: Refactor mr_rtm_dumproute
Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
mr_table_dump for dumps by specific table id.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David Ahern
bae9a78b17 net/mpls: Plumb support for filtering route dumps
Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David Ahern
13e38901d4 net/ipv6: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device
index, protocol, and route type. If the table id is given in the filter,
lookup the table and call fib6_dump_table directly for it.

Move the existing route flags check for prefix only routes to the new
filter.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David Ahern
18a8021a7b net/ipv4: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David Ahern
4724676d55 net: Add struct for fib dump filter
Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.

Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David Ahern
22e6c58b8c netlink: Add answer_flags to netlink_callback
With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.

This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.

The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-16 00:13:12 -07:00
David S. Miller
e85679511e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-10-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
   sk_msg BPF integration for the latter, from Daniel and John.

2) Enable BPF syscall side to indicate for maps that they do not support
   a map lookup operation as opposed to just missing key, from Prashant.

3) Add bpftool map create command which after map creation pins the
   map into bpf fs for further processing, from Jakub.

4) Add bpftool support for attaching programs to maps allowing sock_map
   and sock_hash to be used from bpftool, from John.

5) Improve syscall BPF map update/delete path for map-in-map types to
   wait a RCU grace period for pending references to complete, from Daniel.

6) Couple of follow-up fixes for the BPF socket lookup to get it
   enabled also when IPv6 is compiled as a module, from Joe.

7) Fix a generic-XDP bug to handle the case when the Ethernet header
   was mangled and thus update skb's protocol and data, from Jesper.

8) Add a missing BTF header length check between header copies from
   user space, from Wenwen.

9) Minor fixups in libbpf to use __u32 instead u32 types and include
   proper perf_event.h uapi header instead of perf internal one, from Yonghong.

10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
    to bpftool's build, from Jiri.

11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
    with_addr.sh script from flow dissector selftest, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 23:21:07 -07:00
Heiner Kallweit
c45d715065 net: phy: merge phy_start_aneg and phy_start_aneg_priv
After commit 9f2959b6b5 ("net: phy: improve handling delayed work")
the sync parameter isn't needed any longer in phy_start_aneg_priv().
This allows to merge phy_start_aneg() and phy_start_aneg_priv().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 23:13:07 -07:00
Haiyang Zhang
0054795561 hv_netvsc: fix vf serial matching with pci slot info
The VF device's serial number is saved as a string in PCI slot's
kobj name, not the slot->number. This patch corrects the netvsc
driver, so the VF device can be successfully paired with synthetic
NIC.

Fixes: 00d7ddba11 ("hv_netvsc: pair VF based on serial number")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:58:11 -07:00
David S. Miller
b13949678b Merge branch 'tcp-second-round-for-EDT-conversion'
Eric Dumazet says:

====================
tcp: second round for EDT conversion

First round of EDT patches left TCP stack in a non optimal state.

- High speed flows suffered from loss of performance, addressed
  by the first patch of this series.

- Second patch brings pacing to the current state of networking,
  since we now reach ~100 Gbit on a single TCP flow.

- Third patch implements a mitigation for scheduling delays,
  like the one we did in sch_fq in the past.

- Fourth patch removes one special case in sch_fq for ACK packets.

- Fifth patch removes a serious perfomance cost for TCP internal
  pacing. We should setup the high resolution timer only if
  really needed.

- Sixth patch fixes a typo in BBR.

- Last patch is one minor change in cdg congestion control.

Neal Cardwell also has a patch series fixing BBR after
EDT adoption.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
825e1c523d tcp: cdg: use tcp high resolution clock cache
We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Neal Cardwell
97ec3eb33d tcp_bbr: fix typo in bbr_pacing_margin_percent
There was a typo in this parameter name.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
864e5c0907 tcp: optimize tcp internal pacing
When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.

But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.

This leads to extra overhead (high number of IRQ)

Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.

This leads to a ~10% performance increase in TCP_RR workloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
7baf33bdac net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
a7a2563064 tcp: mitigate scheduling jitter in EDT pacing model
In commit fefa569a9d ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.

This patch does the same in TCP stack, now it is using EDT model.

Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.

This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.

Consider following case :

An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec

This is like having a token bucket with a depth of half
an skb.

Tested:

tc qdisc replace dev eth0 root pfifo_fast

Before
netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
540000 262144 262144    10.00    7710.43

After :
netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
540000 262144 262144    10.00    7999.75   # Much closer to 8000Mbit target

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
76a9ebe811 net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

This patch adds no cost for 32bit kernels.

The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.

Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.

State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                 timer:(on,003ms,0) ino:91863 sk:2 <->
 skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
 ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
 rcvmss:536 advmss:1448
 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
 segs_in:3916318 data_segs_out:177279175
 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
 send 28045.5Mbps lastrcv:73333
 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
 busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
 notsent:2085120 minrtt:0.013

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:42 -07:00
Eric Dumazet
5f6188a800 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0e9 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:56:41 -07:00
Li RongQing
1a3aea2534 net: bridge: fix a possible memory leak in __vlan_add
After per-port vlan stats, vlan stats should be released
when fail to add vlan

Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
CC: bridge@lists.linux-foundation.org
cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:53:52 -07:00
David Howells
bc0e7cf433 rxrpc: Add /proc/net/rxrpc/peers to display peer list
Add /proc/net/rxrpc/peers to display the list of peers currently active.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:52:58 -07:00
Wei Yongjun
d275444cc3 fore200e: fix missing unlock on error in bsq_audit()
Add the missing unlock before return from function bsq_audit()
in the error handling case.

Fixes: 1d9d8be917 ("fore200e: check for dma mapping failures")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:48:35 -07:00
David S. Miller
65f2247d61 Merge branch 'bnxt_en-Add-support-for-new-57500-chips'
Michael Chan says:

====================
bnxt_en: Add support for new 57500 chips.

This patch-set is larger than normal because I wanted a complete series
to add basic support for the new 57500 chips.  The new chips have the
following main differences compared to legacy chips:

1. Requires the PF driver to allocate DMA context memory as a backing
store.
2. New NQ (notification queue) for interrupt events.
3. One or more CP rings can be associated with an NQ.
4. 64-bit doorbells.

Most other structures and firmware APIs are compatible with legacy
devices with some exceptions.  For example, ring groups are no longer
used and RSS table format has changed.

The patch-set includes the usual firmware spec. update, some refactoring
and restructuring, and adding the new code to add basic support for the
new class of devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 22:44:33 -07:00