Commit Graph

29949 Commits

Author SHA1 Message Date
Jiong Wang
c342dc109a bpf: refactor "check_reg_arg" to eliminate code redundancy
There are a few "regs[regno]" here are there across "check_reg_arg", this
patch factor it out into a simple "reg" pointer. The intention is to
simplify code indentation and make the later patches in this set look
cleaner.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 17:06:33 -07:00
Jiong Wang
55e7f3b5ac bpf: factor out reg and stack slot propagation into "propagate_liveness_reg"
After code refactor in previous patches, the propagation logic inside the
for loop in "propagate_liveness" becomes clear that they are good enough to
be factored out into a common function "propagate_liveness_reg".

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 17:06:33 -07:00
Jiong Wang
3f8cafa413 bpf: refactor propagate_liveness to eliminate code redundance
Access to reg states were not factored out, the consequence is long code
for dereferencing them which made the indentation not good for reading.

This patch factor out these code so the core code in the loop could be
easier to follow.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 17:06:33 -07:00
Jiong Wang
1b04aee7e2 bpf: refactor propagate_liveness to eliminate duplicated for loop
Propagation for register and stack slot are finished in separate for loop,
while they are perfect to be put into a single loop.

This could also let them share some common variables in later patches.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 17:06:33 -07:00
Andrey Ignatov
51356ac89b bpf: Fix distinct pointer types warning for ARCH=i386
Fix a new warning reported by kbuild for make ARCH=i386:

   In file included from kernel/bpf/cgroup.c:11:0:
   kernel/bpf/cgroup.c: In function '__cgroup_bpf_run_filter_sysctl':
   include/linux/kernel.h:827:29: warning: comparison of distinct pointer types lacks a cast
      (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                ^
   include/linux/kernel.h:841:4: note: in expansion of macro '__typecheck'
      (__typecheck(x, y) && __no_side_effects(x, y))
       ^~~~~~~~~~~
   include/linux/kernel.h:851:24: note: in expansion of macro '__safe_cmp'
     __builtin_choose_expr(__safe_cmp(x, y), \
                           ^~~~~~~~~~
   include/linux/kernel.h:860:19: note: in expansion of macro '__careful_cmp'
    #define min(x, y) __careful_cmp(x, y, <)
                      ^~~~~~~~~~~~~
>> kernel/bpf/cgroup.c:837:17: note: in expansion of macro 'min'
      ctx.new_len = min(PAGE_SIZE, *pcount);
                    ^~~

Fixes: 4e63acdff8 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 16:11:55 -07:00
Andrey Ignatov
d7a4cb9b67 bpf: Introduce bpf_strtol and bpf_strtoul helpers
Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
long correspondingly. It's similar to user space strtol(3) and
strtoul(3) with a few changes to the API:

* instead of NUL-terminated C string the helpers expect buffer and
  buffer length;

* resulting long or unsigned long is returned in a separate
  result-argument;

* return value is used to indicate success or failure, on success number
  of consumed bytes is returned that can be used to identify position to
  read next if the buffer is expected to contain multiple integers;

* instead of *base* argument, *flags* is used that provides base in 5
  LSB, other bits are reserved for future use;

* number of supported bases is limited.

Documentation for the new helpers is provided in bpf.h UAPI.

The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
be able to convert string input to e.g. "ulongvec" output.

E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
parsed by calling to bpf_strtoul three times.

Implementation notes:

Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
functions. It's done exactly same way as fs/proc/base.c already does.

Unfortunately existing kstrtoX function can't be used directly since
they fail if any invalid character is present right after integer in the
string. Existing simple_strtoX functions can't be used either since
they're obsolete and don't handle overflow properly.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:59 -07:00
Andrey Ignatov
57c3bb725a bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types
Currently the way to pass result from BPF helper to BPF program is to
provide memory area defined by pointer and size: func(void *, size_t).

It works great for generic use-case, but for simple types, such as int,
it's overkill and consumes two arguments when it could use just one.

Introduce new argument types ARG_PTR_TO_INT and ARG_PTR_TO_LONG to be
able to pass result from helper to program via pointer to int and long
correspondingly: func(int *) or func(long *).

New argument types are similar to ARG_PTR_TO_MEM with the following
differences:
* they don't require corresponding ARG_CONST_SIZE argument, predefined
  access sizes are used instead (32bit for int, 64bit for long);
* it's possible to use more than one such an argument in a helper;
* provided pointers have to be aligned.

It's easy to introduce similar ARG_PTR_TO_CHAR and ARG_PTR_TO_SHORT
argument types. It's not done due to lack of use-case though.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:59 -07:00
Andrey Ignatov
e1550bfe0d bpf: Add file_pos field to bpf_sysctl ctx
Add file_pos field to bpf_sysctl context to read and write sysctl file
position at which sysctl is being accessed (read or written).

The field can be used to e.g. override whole sysctl value on write to
sysctl even when sys_write is called by user space with file_pos > 0. Or
BPF program may reject such accesses.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
Andrey Ignatov
4e63acdff8 bpf: Introduce bpf_sysctl_{get,set}_new_value helpers
Add helpers to work with new value being written to sysctl by user
space.

bpf_sysctl_get_new_value() copies value being written to sysctl into
provided buffer.

bpf_sysctl_set_new_value() overrides new value being written by user
space with a one from provided buffer. Buffer should contain string
representation of the value, similar to what can be seen in /proc/sys/.

Both helpers can be used only on sysctl write.

File position matters and can be managed by an interface that will be
introduced separately. E.g. if user space calls sys_write to a file in
/proc/sys/ at file position = X, where X > 0, then the value set by
bpf_sysctl_set_new_value() will be written starting from X. If program
wants to override whole value with specified buffer, file position has
to be set to zero.

Documentation for the new helpers is provided in bpf.h UAPI.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
Andrey Ignatov
1d11b3016c bpf: Introduce bpf_sysctl_get_current_value helper
Add bpf_sysctl_get_current_value() helper to copy current sysctl value
into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.

It provides same string as user space can see by reading corresponding
file in /proc/sys/, including new line, etc.

Documentation for the new helper is provided in bpf.h UAPI.

Since current value is kept in ctl_table->data in a parsed form,
ctl_table->proc_handler() with write=0 is called to read that data and
convert it to a string. Such a string can later be parsed by a program
using helpers that will be introduced separately.

Unfortunately it's not trivial to provide API to access parsed data due to
variety of data representations (string, intvec, uintvec, ulongvec,
custom structures, even NULL, etc). Instead it's assumed that user know
how to handle specific sysctl they're interested in and appropriate
helpers can be used.

Since ctl_table->proc_handler() expects __user buffer, conversion to
__user happens for kernel allocated one where the value is stored.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
Andrey Ignatov
808649fb78 bpf: Introduce bpf_sysctl_get_name helper
Add bpf_sysctl_get_name() helper to copy sysctl name (/proc/sys/ entry)
into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.

By default full name (w/o /proc/sys/) is copied, e.g. "net/ipv4/tcp_mem".

If BPF_F_SYSCTL_BASE_NAME flag is set, only base name will be copied,
e.g. "tcp_mem".

Documentation for the new helper is provided in bpf.h UAPI.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
Andrey Ignatov
7b146cebe3 bpf: Sysctl hook
Containerized applications may run as root and it may create problems
for whole host. Specifically such applications may change a sysctl and
affect applications in other containers.

Furthermore in existing infrastructure it may not be possible to just
completely disable writing to sysctl, instead such a process should be
gradual with ability to log what sysctl are being changed by a
container, investigate, limit the set of writable sysctl to currently
used ones (so that new ones can not be changed) and eventually reduce
this set to zero.

The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.

New program type has access to following minimal context:
	struct bpf_sysctl {
		__u32	write;
	};

Where @write indicates whether sysctl is being read (= 0) or written (=
1).

Helpers to access sysctl name and value will be introduced separately.

BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
passing control to ctl_table->proc_handler so that BPF program can
either allow or deny access to sysctl.

Suggested-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
Andrey Ignatov
b1cd609d9b bpf: Add base proto function for cgroup-bpf programs
Currently kernel/bpf/cgroup.c contains only one program type and one
proto function cgroup_dev_func_proto(). It'd be useful to have base
proto function that can be reused for new cgroup-bpf program types
coming soon.

Introduce cgroup_base_func_proto().

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-12 13:54:58 -07:00
David S. Miller
bb23581b9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-12

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Improve BPF verifier scalability for large programs through two
   optimizations: i) remove verifier states that are not useful in pruning,
   ii) stop walking parentage chain once first LIVE_READ is seen. Combined
   gives approx 20x speedup. Increase limits for accepting large programs
   under root, and add various stress tests, from Alexei.

2) Implement global data support in BPF. This enables static global variables
   for .data, .rodata and .bss sections to be properly handled which allows
   for more natural program development. This also opens up the possibility
   to optimize program workflow by compiling ELFs only once and later only
   rewriting section data before reload, from Daniel and with test cases and
   libbpf refactoring from Joe.

3) Add config option to generate BTF type info for vmlinux as part of the
   kernel build process. DWARF debug info is converted via pahole to BTF.
   Latter relies on libbpf and makes use of BTF deduplication algorithm which
   results in 100x savings compared to DWARF data. Resulting .BTF section is
   typically about 2MB in size, from Andrii.

4) Add BPF verifier support for stack access with variable offset from
   helpers and add various test cases along with it, from Andrey.

5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
   so that L2 encapsulation can be used for tc tunnels, from Alan.

6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
   users can define a subset of allowed __sk_buff fields that get fed into
   the test program, from Stanislav.

7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
   various UBSAN warnings in bpftool, from Yonghong.

8) Generate a pkg-config file for libbpf, from Luca.

9) Dump program's BTF id in bpftool, from Prashant.

10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
    program, from Magnus.

11) kallsyms related fixes for the case when symbols are not present in
    BPF selftests and samples, from Daniel
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 17:00:05 -07:00
Stanislav Fomichev
b0b9395d86 bpf: support input __sk_buff context in BPF_PROG_TEST_RUN
Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
* ctx_in/ctx_size_in - input context
* ctx_out/ctx_size_out - output context

The intended use case is to pass some meta data to the test runs that
operate on skb (this has being brought up on recent LPC).

For programs that use bpf_prog_test_run_skb, support __sk_buff input and
output. Initially, from input __sk_buff, copy _only_ cb and priority into
skb, all other non-zero fields are prohibited (with EINVAL).
If the user has set ctx_out/ctx_size_out, copy the potentially modified
__sk_buff back to the userspace.

We require all fields of input __sk_buff except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.

The API is intentionally vague (i.e. we don't explicitly add __sk_buff
to bpf_attr, but ctx_in) to potentially let other test_run types use
this interface in the future (this can be xdp_md for xdp types for
example).

v4:
  * don't copy more than allowed in bpf_ctx_init [Martin]

v3:
  * handle case where ctx_in is NULL, but ctx_out is not [Martin]
  * convert size==0 checks to ptr==NULL checks and add some extra ptr
    checks [Martin]

v2:
  * Addressed comments from Martin Lau

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-11 10:21:40 +02:00
Daniel Borkmann
2824ecb701 bpf: allow for key-less BTF in array map
Given we'll be reusing BPF array maps for global data/bss/rodata
sections, we need a way to associate BTF DataSec type as its map
value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
macro hack e.g. via 38d5d3b3d5 ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
to get initial map to type association going. While more use cases
for it are discouraged, this also won't work for global data since
the use of array map is a BPF loader detail and therefore unknown
at compilation time. For array maps with just a single entry we make
an exception in terms of BTF in that key type is declared optional
if value type is of DataSec type. The latter LLVM is guaranteed to
emit and it also aligns with how we regard global data maps as just
a plain buffer area reusing existing map facilities for allowing
things like introspection with existing tools.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
1dc9285184 bpf: kernel side support for BTF Var and DataSec
This work adds kernel-side verification, logging and seq_show dumping
of BTF Var and DataSec kinds which are emitted with latest LLVM. The
following constraints apply:

BTF Var must have:

- Its kind_flag is 0
- Its vlen is 0
- Must point to a valid type
- Type must not resolve to a forward type
- Size of underlying type must be > 0
- Must have a valid name
- Can only be a source type, not sink or intermediate one
- Name may include dots (e.g. in case of static variables
  inside functions)
- Cannot be a member of a struct/union
- Linkage so far can either only be static or global/allocated

BTF DataSec must have:

- Its kind_flag is 0
- Its vlen cannot be 0
- Its size cannot be 0
- Must have a valid name
- Can only be a source type, not sink or intermediate one
- Name may include dots (e.g. to represent .bss, .data, .rodata etc)
- Cannot be a member of a struct/union
- Inner btf_var_secinfo array with {type,offset,size} triple
  must be sorted by offset in ascending order
- Type must always point to BTF Var
- BTF resolved size of Var must be <= size provided by triple
- DataSec size must be >= sum of triple sizes (thus holes
  are allowed)

btf_var_resolve(), btf_ptr_resolve() and btf_modifier_resolve()
are on a high level quite similar but each come with slight,
subtle differences. They could potentially be a bit refactored
in future which hasn't been done here to ease review.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
3e0ddc4f3f bpf: allow . char as part of the object name
Trivial addition to allow '.' aside from '_' as "special" characters
in the object name. Used to allow for substrings in maps from loader
side such as ".bss", ".data", ".rodata", but could also be useful for
other purposes.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
87df15de44 bpf: add syscall side map freeze support
This patch adds a new BPF_MAP_FREEZE command which allows to
"freeze" the map globally as read-only / immutable from syscall
side.

Map permission handling has been refactored into map_get_sys_perms()
and drops FMODE_CAN_WRITE in case of locked map. Main use case is
to allow for setting up .rodata sections from the BPF ELF which
are loaded into the kernel, meaning BPF loader first allocates
map, sets up map value by copying .rodata section into it and once
complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
modifications.

Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
bpf_attr members are required to be zero. I didn't add write-only
locking here as counterpart since I don't have a concrete use-case
for it on my side, and I think it makes probably more sense to wait
once there is actually one. In that case bpf_attr can be extended
as usual with a flag field and/or others where flag 0 means that
we lock the map read-only hence this doesn't prevent to add further
extensions to BPF_MAP_FREEZE upon need.

A map creation flag like BPF_F_WRONCE was not considered for couple
of reasons: i) in case of a generic implementation, a map can consist
of more than just one element, thus there could be multiple map
updates needed to set the map into a state where it can then be
made immutable, ii) WRONCE indicates exact one-time write before
it is then set immutable. A generic implementation would set a bit
atomically on map update entry (if unset), indicating that every
subsequent update from then onwards will need to bail out there.
However, map updates can fail, so upon failure that flag would need
to be unset again and the update attempt would need to be repeated
for it to be eventually made immutable. While this can be made
race-free, this approach feels less clean and in combination with
reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
command directly sets the flag and caller has the guarantee that
map is immutable from syscall side upon successful return for any
future syscall invocations that would alter the map state, which
is also more intuitive from an API point of view. A command name
such as BPF_MAP_LOCK has been avoided as it's too close with BPF
map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
is so far only enabled for privileged users.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
591fe9888d bpf: add program side {rd, wr}only support for maps
This work adds two new map creation flags BPF_F_RDONLY_PROG
and BPF_F_WRONLY_PROG in order to allow for read-only or
write-only BPF maps from a BPF program side.

Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
applies to system call side, meaning the BPF program has full
read/write access to the map as usual while bpf(2) calls with
map fd can either only read or write into the map depending
on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
for the exact opposite such that verifier is going to reject
program loads if write into a read-only map or a read into a
write-only map is detected. For read-only map case also some
helpers are forbidden for programs that would alter the map
state such as map deletion, update, etc. As opposed to the two
BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

We've enabled this generic map extension to various non-special
maps holding normal user data: array, hash, lru, lpm, local
storage, queue and stack. Further generic map types could be
followed up in future depending on use-case. Main use case
here is to forbid writes into .rodata map values from verifier
side.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
be70bcd53d bpf: do not retain flags that are not tied to map lifetime
Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file
descriptor, but not to the map object itself! Meaning, at map
creation time BPF_F_RDONLY can be set to make the map read-only
from syscall side, but this holds only for the returned fd, so
any other fd either retrieved via bpf file system or via map id
for the very same underlying map object can have read-write access
instead.

Given that, keeping the two flags around in the map_flags attribute
and exposing them to user space upon map dump is misleading and
may lead to false conclusions. Since these two flags are not
tied to the map object lets also not store them as map property.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
Daniel Borkmann
d8eca5bbb2 bpf: implement lookup-free direct value access for maps
This generic extension to BPF maps allows for directly loading
an address residing inside a BPF map value as a single BPF
ldimm64 instruction!

The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
is a special src_reg flag for ldimm64 instruction that indicates
that inside the first part of the double insns's imm field is a
file descriptor which the verifier then replaces as a full 64bit
address of the map into both imm parts. For the newly added
BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
the first part of the double insns's imm field is again a file
descriptor corresponding to the map, and the second part of the
imm field is an offset into the value. The verifier will then
replace both imm parts with an address that points into the BPF
map value at the given value offset for maps that support this
operation. Currently supported is array map with single entry.
It is possible to support more than just single map element by
reusing both 16bit off fields of the insns as a map index, so
full array map lookup could be expressed that way. It hasn't
been implemented here due to lack of concrete use case, but
could easily be done so in future in a compatible way, since
both off fields right now have to be 0 and would correctly
denote a map index 0.

The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
map pointer versus load of map's value at offset 0, and changing
BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
regular map pointer and map value pointer would add unnecessary
complexity and increases barrier for debugability thus less
suitable. Using the second part of the imm field as an offset
into the value does /not/ come with limitations since maximum
possible value size is in u32 universe anyway.

This optimization allows for efficiently retrieving an address
to a map value memory area without having to issue a helper call
which needs to prepare registers according to calling convention,
etc, without needing the extra NULL test, and without having to
add the offset in an additional instruction to the value base
pointer. The verifier then treats the destination register as
PTR_TO_MAP_VALUE with constant reg->off from the user passed
offset from the second imm field, and guarantees that this is
within bounds of the map value. Any subsequent operations are
normally treated as typical map value handling without anything
extra needed from verification side.

The two map operations for direct value access have been added to
array map for now. In future other types could be supported as
well depending on the use case. The main use case for this commit
is to allow for BPF loader support for global variables that
reside in .data/.rodata/.bss sections such that we can directly
load the address of them with minimal additional infrastructure
required. Loader support has been added in subsequent commits for
libbpf library.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 17:05:46 -07:00
David S. Miller
310655b07a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-04-08 23:39:36 -07:00
Linus Torvalds
f654f0fc0b Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel/sysctl.c: fix out-of-bounds access when setting file-max
  mm/util.c: fix strndup_user() comment
  sh: fix multiple function definition build errors
  MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
  MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
  mm: writeback: use exact memcg dirty counts
  psi: clarify the units used in pressure files
  mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
  hugetlbfs: fix memory leak for resv_map
  mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
  lib/lzo: fix bugs for very short or empty input
  include/linux/bitrev.h: fix constant bitrev
  kmemleak: powerpc: skip scanning holes in the .bss section
  lib/string.c: implement a basic bcmp
2019-04-05 17:08:55 -10:00
Will Deacon
9002b21465 kernel/sysctl.c: fix out-of-bounds access when setting file-max
Commit 32a5ad9c22 ("sysctl: handle overflow for file-max") hooked up
min/max values for the file-max sysctl parameter via the .extra1 and
.extra2 fields in the corresponding struct ctl_table entry.

Unfortunately, the minimum value points at the global 'zero' variable,
which is an int.  This results in a KASAN splat when accessed as a long
by proc_doulongvec_minmax on 64-bit architectures:

  | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
  | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
  |
  | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
  | Hardware name: linux,dummy-virt (DT)
  | Call trace:
  |  dump_backtrace+0x0/0x228
  |  show_stack+0x14/0x20
  |  dump_stack+0xe8/0x124
  |  print_address_description+0x60/0x258
  |  kasan_report+0x140/0x1a0
  |  __asan_report_load8_noabort+0x18/0x20
  |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
  |  proc_doulongvec_minmax+0x4c/0x78
  |  proc_sys_call_handler.isra.19+0x144/0x1d8
  |  proc_sys_write+0x34/0x58
  |  __vfs_write+0x54/0xe8
  |  vfs_write+0x124/0x3c0
  |  ksys_write+0xbc/0x168
  |  __arm64_sys_write+0x68/0x98
  |  el0_svc_common+0x100/0x258
  |  el0_svc_handler+0x48/0xc0
  |  el0_svc+0x8/0xc
  |
  | The buggy address belongs to the variable:
  |  zero+0x0/0x40
  |
  | Memory state around the buggy address:
  |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
  |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
  | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
  |                                ^
  |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
  |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fix the splat by introducing a unsigned long 'zero_ul' and using that
instead.

Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
Fixes: 32a5ad9c22 ("sysctl: handle overflow for file-max")
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:31 -10:00
Linus Torvalds
970b766cfd Andy Lutomirski approached me to tell me that the syscall_get_arguments()
implementation in x86 was horrible and gcc  certainly gets it wrong. He
 said that since the tracepoints only pass in 0 and 6 for i and n repectively,
 it should be optimized for that case. Inspecting the kernel, I discovered
 that all users pass in 0 for i and only one file passing in something other
 than 6 for the number of arguments. That code happens to be my own code used
 for the special syscall tracing. That can easily be converted to just
 using 0 and 6 as well, and only copying what is needed. Which is probably
 the faster path anyway for that case.
 
 Along the way, a couple of real fixes came from this as the
 syscall_get_arguments() function was incorrect for csky and riscv.
 
 x86 has been optimized to for the new interface that removes the variable
 number of arguments, but the other architectures could still use some
 loving and take more advantage of the simpler interface.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXKdi7RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qjtiAQDaZbFaSgEbs99jjuAPDSZ0li8dyUOC
 3KS5TyuLw+fEaAD/QZnKjplVFAfA5FxrABZ0ioIKDON4nLyESEb+xCv0gA4=
 =dTuo
 -----END PGP SIGNATURE-----

Merge tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull syscall-get-arguments cleanup and fixes from Steven Rostedt:
 "Andy Lutomirski approached me to tell me that the
  syscall_get_arguments() implementation in x86 was horrible and gcc
  certainly gets it wrong.

  He said that since the tracepoints only pass in 0 and 6 for i and n
  repectively, it should be optimized for that case. Inspecting the
  kernel, I discovered that all users pass in 0 for i and only one file
  passing in something other than 6 for the number of arguments. That
  code happens to be my own code used for the special syscall tracing.

  That can easily be converted to just using 0 and 6 as well, and only
  copying what is needed. Which is probably the faster path anyway for
  that case.

  Along the way, a couple of real fixes came from this as the
  syscall_get_arguments() function was incorrect for csky and riscv.

  x86 has been optimized to for the new interface that removes the
  variable number of arguments, but the other architectures could still
  use some loving and take more advantage of the simpler interface"

* tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  syscalls: Remove start and number from syscall_set_arguments() args
  syscalls: Remove start and number from syscall_get_arguments() args
  csky: Fix syscall_get_arguments() and syscall_set_arguments()
  riscv: Fix syscall_get_arguments() and syscall_set_arguments()
  tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
  ptrace: Remove maxargs from task_current_syscall()
2019-04-05 13:15:57 -10:00
David S. Miller
f83f715195 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment merge conflict in mlx5.

Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 14:14:19 -07:00
Andrey Ignatov
1fbd20f8b7 bpf: Add missed newline in verifier verbose log
check_stack_access() that prints verbose log is used in
adjust_ptr_min_max_vals() that prints its own verbose log and now they
stick together, e.g.:

  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
  size=1R2 stack pointer arithmetic goes out of range, prohibited for
  !root

Add missing newline so that log is more readable:
  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
  R2 stack pointer arithmetic goes out of range, prohibited for !root

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:08 +02:00
Andrey Ignatov
107c26a70c bpf: Sanity check max value for var_off stack access
As discussed in [1] max value of variable offset has to be checked for
overflow on stack access otherwise verifier would accept code like this:

  0: (b7) r2 = 6
  1: (b7) r3 = 28
  2: (7a) *(u64 *)(r10 -16) = 0
  3: (7a) *(u64 *)(r10 -8) = 0
  4: (79) r4 = *(u64 *)(r1 +168)
  5: (c5) if r4 s< 0x0 goto pc+4
   R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
   R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
   0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
  6: (17) r4 -= 16
  7: (0f) r4 += r10
  8: (b7) r5 = 8
  9: (85) call bpf_getsockopt#57
  10: (b7) r0 = 0
  11: (95) exit

, where R4 obviosly has unbounded max value.

Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
BPF_MAX_VAR_OFF) range.

reg->smax_value is used instead of reg->umax_value because stack
pointers are calculated using negative offset from fp. This is opposite
to e.g. map access where offset must be non-negative and where
umax_value is used.

Also dedicated verbose logs are added for both min and max bound check
failures to have diagnostics consistent with variable offset handling in
check_map_access().

[1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:08 +02:00
Andrey Ignatov
088ec26d9c bpf: Reject indirect var_off stack access in unpriv mode
Proper support of indirect stack access with variable offset in
unprivileged mode (!root) requires corresponding support in Spectre
masking for stack ALU in retrieve_ptr_limit().

There are no use-case for variable offset in unprivileged mode though so
make verifier reject such accesses for simplicity.

Pointer arithmetics is one (and only?) way to cause variable offset and
it's already rejected in unpriv mode so that verifier won't even get to
helper function whose argument contains variable offset, e.g.:

  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
  stack pointer arithmetic goes out of range, prohibited for !root

Still it looks like a good idea to reject variable offset indirect stack
access for unprivileged mode in check_stack_boundary() explicitly.

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:07 +02:00
Andrey Ignatov
f2bcd05ec7 bpf: Reject indirect var_off stack access in raw mode
It's hard to guarantee that whole memory is marked as initialized on
helper return if uninitialized stack is accessed with variable offset
since specific bounds are unknown to verifier. This may cause
uninitialized stack leaking.

Reject such an access in check_stack_boundary to prevent possible
leaking.

There are no known use-cases for indirect uninitialized stack access
with variable offset so it shouldn't break anything.

Fixes: 2011fccfb6 ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-05 16:50:07 +02:00
Steven Rostedt (Red Hat)
b35f549df1 syscalls: Remove start and number from syscall_get_arguments() args
At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
function call syscall_get_arguments() implemented in x86 was horribly
written and not optimized for the standard case of passing in 0 and 6 for
the starting index and the number of system calls to get. When looking at
all the users of this function, I discovered that all instances pass in only
0 and 6 for these arguments. Instead of having this function handle
different cases that are never used, simply rewrite it to return the first 6
arguments of a system call.

This should help out the performance of tracing system calls by ptrace,
ftrace and perf.

Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Martin <dave.martin@arm.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
Reviewed-by: Dmitry V. Levin <ldv@altlinux.org>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-05 09:26:43 -04:00
Linus Torvalds
0548740e53 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Several hash table refcount fixes in batman-adv, from Sven
    Eckelmann.

 2) Use after free in bpf_evict_inode(), from Daniel Borkmann.

 3) Fix mdio bus registration in ixgbe, from Ivan Vecera.

 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni.

 5) ila rhashtable corruption fix from Herbert Xu.

 6) Don't allow upper-devices to be added to vrf devices, from Sabrina
    Dubroca.

 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork.

 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype,
    from Alexander Lobakin.

 9) Missing IDR checks in mlx5 driver, from Aditya Pakki.

10) Fix false connection termination in ktls, from Jakub Kicinski.

11) Work around some ASPM issues with r8169 by disabling rx interrupt
    coalescing on certain chips. From Heiner Kallweit.

12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo
    Abeni.

13) Fully initialize sockaddr_in structures in SCTP, from Xin Long.

14) Various BPF flow dissector fixes from Stanislav Fomichev.

15) Divide by zero in act_sample, from Davide Caratti.

16) Fix bridging multicast regression introduced by rhashtable
    conversion, from Nikolay Aleksandrov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
  ibmvnic: Fix completion structure initialization
  ipv6: sit: reset ip header pointer in ipip6_rcv
  net: bridge: always clear mcast matching struct on reports and leaves
  libcxgb: fix incorrect ppmax calculation
  vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
  sch_cake: Make sure we can write the IP header before changing DSCP bits
  sch_cake: Use tc_skb_protocol() helper for getting packet protocol
  tcp: Ensure DCTCP reacts to losses
  net/sched: act_sample: fix divide by zero in the traffic path
  net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop
  net: hns: Fix sparse: some warnings in HNS drivers
  net: hns: Fix WARNING when remove HNS driver with SMMU enabled
  net: hns: fix ICMP6 neighbor solicitation messages discard problem
  net: hns: Fix probabilistic memory overwrite when HNS driver initialized
  net: hns: Use NAPI_POLL_WEIGHT for hns driver
  net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
  flow_dissector: rst'ify documentation
  ipv6: Fix dangling pointer when ipv6 fragment
  net-gro: Fix GRO flush when receiving a GSO packet.
  flow_dissector: document BPF flow dissector environment
  ...
2019-04-04 18:07:12 -10:00
Steven Rostedt (Red Hat)
d08e411397 tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
The only users that calls syscall_get_arguments() with a variable and not a
hard coded '6' is ftrace_syscall_enter(). syscall_get_arguments() can be
optimized by removing a variable input, and always grabbing 6 arguments
regardless of what the system call actually uses.

Change ftrace_syscall_enter() to pass the 6 args into a local stack array
and copy the necessary arguments into the trace event as needed.

This is needed to remove two parameters from syscall_get_arguments().

Link: http://lkml.kernel.org/r/20161107213233.627583542@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-04 09:17:52 -04:00
Alexei Starovoitov
7a9f5c65ab bpf: increase verifier log limit
The existing 16Mbyte verifier log limit is not enough for log_level=2
even for small programs. Increase it to 1Gbyte.
Note it's not a kernel memory limit.
It's an amount of memory user space provides to store
the verifier log. The kernel populates it 1k at a time.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
c04c0d2b96 bpf: increase complexity limit and maximum program size
Large verifier speed improvements allow to increase
verifier complexity limit.
Now regardless of the program composition and its size it takes
little time for the verifier to hit insn_processed limit.
On typical x86 machine non-debug kernel processes 1M instructions
in 1/10 of a second.
(before these speed improvements specially crafted programs
could be hitting multi-second verification times)
Full kasan kernel with debug takes ~1 second for the same 1M insns.
Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M.
Also increase the number of instructions per program
from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit.
4k limit was confusing to users, since small programs with hundreds
of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit.
Sometimes adding more insns and bpf_trace_printk debug statements
would make the verifier accept the program while removing
code would make the verifier reject it.
Some user space application started to add #define MAX_FOO to
their programs and do:
  MAX_FOO=100;
again:
  compile with MAX_FOO;
  try to load;
  if (fails_to_load) { reduce MAX_FOO; goto again; }
to be able to fit maximum amount of processing into single program.
Other users artificially split their single program into a set of programs
and use all 32 iterations of tail_calls to increase compute limits.
And the most advanced folks used unlimited tc-bpf filter list
to execute many bpf programs.
Essentially the users managed to workaround 4k insn limit.
This patch removes the limit for root programs from uapi.
BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit
and success to load the program no longer depends on program size,
but on 'smartness' of the verifier only.
The verifier will continue to get smarter with every kernel release.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
4f73379ec5 bpf: verbose jump offset overflow check
Larger programs may trigger 16-bit jump offset overflow check
during instruction patching. Make this error verbose otherwise
users cannot decipher error code without printks in the verifier.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
71dde681a8 bpf: convert temp arrays to kvcalloc
Temporary arrays used during program verification need to be vmalloc-ed
to support large bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:38 +02:00
Alexei Starovoitov
25af32dad8 bpf: improve verification speed by not remarking live_read
With large verifier speed improvement brought by the previous patch
mark_reg_read() becomes the hottest function during verification.
On a typical program it consumes 40% of cpu.
mark_reg_read() walks parentage chain of registers to mark parents as LIVE_READ.
Once the register is marked there is no need to remark it again in the future.
Hence stop walking the chain once first LIVE_READ is seen.
This optimization drops mark_reg_read() time from 40% of cpu to <1%
and overall 2x improvement of verification speed.
For some programs the longest_mark_read_walk counter improves from ~500 to ~5

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Alexei Starovoitov
9f4686c41b bpf: improve verification speed by droping states
Branch instructions, branch targets and calls in a bpf program are
the places where the verifier remembers states that led to successful
verification of the program.
These states are used to prune brute force program analysis.
For unprivileged programs there is a limit of 64 states per such
'branching' instructions (maximum length is tracked by max_states_per_insn
counter introduced in the previous patch).
Simply reducing this threshold to 32 or lower increases insn_processed
metric to the point that small valid programs get rejected.
For root programs there is no limit and cilium programs can have
max_states_per_insn to be 100 or higher.
Walking 100+ states multiplied by number of 'branching' insns during
verification consumes significant amount of cpu time.
Turned out simple LRU-like mechanism can be used to remove states
that unlikely will be helpful in future search pruning.
This patch introduces hit_cnt and miss_cnt counters:
hit_cnt - this many times this state successfully pruned the search
miss_cnt - this many times this state was not equivalent to other states
(and that other states were added to state list)

The heuristic introduced in this patch is:
if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
  /* drop this state from future considerations */

Higher numbers increase max_states_per_insn (allow more states to be
considered for pruning) and slow verification speed, but do not meaningfully
reduce insn_processed metric.
Lower numbers drop too many states and insn_processed increases too much.
Many different formulas were considered.
This one is simple and works well enough in practice.
(the analysis was done on selftests/progs/* and on cilium programs)

The end result is this heuristic improves verification speed by 10 times.
Large synthetic programs that used to take a second more now take
1/10 of a second.
In cases where max_states_per_insn used to be 100 or more, now it's ~10.

There is a slight increase in insn_processed for cilium progs:
                       before   after
bpf_lb-DLB_L3.o 	1831	1838
bpf_lb-DLB_L4.o 	3029	3218
bpf_lb-DUNKNOWN.o 	1064	1064
bpf_lxc-DDROP_ALL.o	26309	26935
bpf_lxc-DUNKNOWN.o	33517	34439
bpf_netdev.o		9713	9721
bpf_overlay.o		6184	6184
bpf_lcx_jit.o		37335	39389
And 2-3 times improvement in the verification speed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Alexei Starovoitov
06ee7115b0 bpf: add verifier stats and log_level bit 2
In order to understand the verifier bottlenecks add various stats
and extend log_level:
log_level 1 and 2 are kept as-is:
bit 0 - level=1 - print every insn and verifier state at branch points
bit 1 - level=2 - print every insn and verifier state at every insn
bit 2 - level=4 - print verifier error and stats at the end of verification

When verifier rejects the program the libbpf is trying to load the program twice.
Once with log_level=0 (no messages, only error code is reported to user space)
and second time with log_level=1 to tell the user why the verifier rejected it.

With introduction of bit 2 - level=4 the libbpf can choose to always use that
level and load programs once, since the verification speed is not affected and
in case of error the verbose message will be available.

Note that the verifier stats are not part of uapi just like all other
verbose messages. They're expected to change in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-04 01:27:37 +02:00
Jann Horn
556a888a14
signal: don't silently convert SI_USER signals to non-current pidfd
The current sys_pidfd_send_signal() silently turns signals with explicit
SI_USER context that are sent to non-current tasks into signals with
kernel-generated siginfo.
This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
If a user actually wants to send a signal with kernel-provided siginfo,
they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
this case is unnecessary.

Instead of silently replacing the siginfo, just bail out with an error;
this is consistent with other interfaces and avoids special-casing behavior
based on security checks.

Fixes: 3eb39f4793 ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-04-01 23:03:18 +02:00
Linus Torvalds
c29d85417c Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fixes from Thomas Gleixner:
 "Two SMT/hotplug related fixes:

   - Prevent crash when HOTPLUG_CPU is disabled and the CPU bringup
     aborts. This is triggered with the 'nosmt' command line option, but
     can happen by any abort condition. As the real unplug code is not
     compiled in, prevent the fail by keeping the CPU in zombie state.

   - Enforce HOTPLUG_CPU for SMP on x86 to avoid the above situation
     completely. With 'nosmt' being a popular option it's required to
     unplug the half brought up sibling CPUs (due to the MCE wreckage)
     completely"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
  cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
2019-03-31 08:22:12 -07:00
Linus Torvalds
f78b5be2a5 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A small set of core updates:

   - Make the watchdog respect the selected CPU mask again. That was
     broken by the rework of the watchdog thread management and caused
     inconsistent state and NMI watchdog being unstoppable.

   - Ensure that the objtool build can find the libelf location.

   - Remove dead kcore stub code"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog: Respect watchdog cpumask on CPU hotplug
  objtool: Query pkg-config for libelf location
  proc/kcore: Remove unused kclist_add_remap()
2019-03-31 07:47:21 -07:00
David S. Miller
22bdf7d459 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-03-29

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Bug fix in BTF deduplication that was mishandling an equivalence
   comparison, from Andrii.

2) libbpf Makefile fixes to properly link against libelf for the shared
   object and to actually export AF_XDP's xsk.h header, from Björn.

3) Fix use after free in bpf inode eviction, from Daniel.

4) Fix a bug in skb creation out of cpumap redirect, from Jesper.

5) Remove an unnecessary and triggerable WARN_ONCE() in max number
   of call stack frames checking in verifier, from Paul.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 21:00:28 -07:00
Jesper Dangaard Brouer
676e4a6fe7 xdp: fix cpumap redirect SKB creation bug
We want to avoid leaking pointer info from xdp_frame (that is placed in
top of frame) like commit 6dfb970d3d ("xdp: avoid leaking info stored in
frame data on page reuse"), and followup commit 97e19cce05 ("bpf:
reserve xdp_frame size in xdp headroom") that reserve this headroom.

These changes also affected how cpumap constructed SKBs, as xdpf->headroom
size changed, the skb data starting point were in-effect shifted with 32
bytes (sizeof xdp_frame). This was still okay, as the cpumap frame_size
calculation also included xdpf->headroom which were reduced by same amount.

A bug was introduced in commit 77ea5f4cbe ("bpf/cpumap: make sure
frame_size for build_skb is aligned if headroom isn't"), where the
xdpf->headroom became part of the SKB_DATA_ALIGN rounding up. This
round-up to find the frame_size is in principle still correct as it does
not exceed the 2048 bytes frame_size (which is max for ixgbe and i40e),
but the 32 bytes offset of pkt_data_start puts this over the 2048 bytes
limit. This cause skb_shared_info to spill into next frame. It is a little
hard to trigger, as the SKB need to use above 15 skb_shinfo->frags[] as
far as I calculate. This does happen in practise for TCP streams when
skb_try_coalesce() kicks in.

KASAN can be used to detect these wrong memory accesses, I've seen:
 BUG: KASAN: use-after-free in skb_try_coalesce+0x3cb/0x760
 BUG: KASAN: wild-memory-access in skb_release_data+0xe2/0x250

Driver veth also construct a SKB from xdp_frame in this way, but is not
affected, as it doesn't reserve/deduct the room (used by xdp_frame) from
the SKB headroom. Instead is clears the pointers via xdp_scrub_frame(),
and allows SKB to use this area.

The fix in this patch is to do like veth and instead allow SKB to (re)use
the area occupied by xdp_frame, by clearing via xdp_scrub_frame().  (This
does kill the idea of the SKB being able to access (mem) info from this
area, but I guess it was a bad idea anyhow, and it was already killed by
the veth changes.)

Fixes: 77ea5f4cbe ("bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-29 12:15:02 -07:00
Andrey Ignatov
2011fccfb6 bpf: Support variable offset stack access from helpers
Currently there is a difference in how verifier checks memory access for
helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to
variable part of offset.

check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable
offsets just fine, so that BPF program can call a helper like this:

  some_helper(map_value_ptr + off, size);

, where offset is unknown at load time, but is checked by program to be
in a safe rage (off >= 0 && off + size < map_value_size).

But it's not the case for check_stack_boundary, that is used for
PTR_TO_STACK, and same code with pointer to stack is rejected by
verifier:

  some_helper(stack_value_ptr + off, size);

For example:
  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  6: (18) r1 = 0xffff888111343a80
  8: (85) call bpf_map_lookup_elem#1
  invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4)

Add support for variable offset access to check_stack_boundary so that
if offset is checked by program to be in a safe range it's accepted by
verifier.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-29 12:05:35 -07:00
Andrei Vagin
fcfc2aa018 ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
There are a few system calls (pselect, ppoll, etc) which replace a task
sigmask while they are running in a kernel-space

When a task calls one of these syscalls, the kernel saves a current
sigmask in task->saved_sigmask and sets a syscall sigmask.

On syscall-exit-stop, ptrace traps a task before restoring the
saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
saved_sigmask, when the task returns to user-space.

This patch fixes this problem.  PTRACE_GETSIGMASK returns saved_sigmask
if it's set.  PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.

Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com
Fixes: 29000caecb ("ptrace: add ability to get/set signal-blocked mask")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Thomas Gleixner
206b92353c cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
Tianyu reported a crash in a CPU hotplug teardown callback when booting a
kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
parameter.

It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
forever in case that a bringup callback fails. Unfortunately this issue was
not recognized when the CPU hotplug code was reworked, so the shortcoming
just stayed in place.

When a bringup callback fails, the CPU hotplug code rolls back the
operation and takes the CPU offline.

The 'nosmt' command line argument uses a bringup failure to abort the
bringup of SMT sibling CPUs. This partial bringup is required due to the
MCE misdesign on Intel CPUs.

With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
teardown of a CPU including the synchronizations in various facilities like
RCU, NOHZ and others.

As a consequence the teardown callbacks which must be executed on the
outgoing CPU within stop machine with interrupts disabled are executed on
the control CPU in interrupt enabled and preemptible context causing the
kernel to crash and burn. The pre state machine code has a different
failure mode which is more subtle and resulting in a less obvious use after
free crash because the control side frees resources which are still in use
by the undead CPU.

But this is not a x86 only problem. Any architecture which supports the
SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
likely to be triggered because in 99.99999% of the cases all bringup
callbacks succeed.

The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
all architectures as the following architectures have either no hotplug
support at all or not all subarchitectures support it:

 alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).

Crashing the kernel in such a situation is not an acceptable state
either.

Implement a minimal rollback variant by limiting the teardown to the point
where all regular teardown callbacks have been invoked and leave the CPU in
the 'dead' idle state. This has the following consequences:

 - the CPU is brought down to the point where the stop_machine takedown
   would happen.

 - the CPU stays there forever and is idle

 - The CPU is cleared in the CPU active mask, but not in the CPU online
   mask which is a legit state.

 - Interrupts are not forced away from the CPU

 - All facilities which only look at online mask would still see it, but
   that is the case during normal hotplug/unplug operations as well. It's
   just a (way) longer time frame.

This will expose issues, which haven't been exposed before or only seldom,
because now the normally transient state of being non active but online is
a permanent state. In testing this exposed already an issue vs. work queues
where the vmstat code schedules work on the almost dead CPU which ends up
in an unbound workqueue and triggers 'preemtible context' warnings. This is
not a problem of this change, it merily exposes an already existing issue.
Still this is better than crashing fully without a chance to debug it.

This is mainly thought as workaround for those architectures which do not
support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.

Fixes: 2e1a3483ce ("cpu/hotplug: Split out the state walk into functions")
Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
2019-03-28 13:34:58 +01:00
Thomas Gleixner
7dd4761711 watchdog: Respect watchdog cpumask on CPU hotplug
The rework of the watchdog core to use cpu_stop_work broke the watchdog
cpumask on CPU hotplug.

The watchdog_enable/disable() functions are now called unconditionally from
the hotplug callback, i.e. even on CPUs which are not in the watchdog
cpumask. As a consequence the watchdog can become unstoppable.

Only invoke them when the plugged CPU is in the watchdog cpumask.

Fixes: 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
2019-03-28 13:32:01 +01:00