linux/arch
Martin KaFai Lau 85d33df357 bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
   running kernel.
   Instead of reusing the attr->btf_value_type_id,
   btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
   used as the "user" btf which could store other useful sysadmin/debug
   info that may be introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
   in the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
for the purose of tracking the subsystem usage.  Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
	key 4B  value 256B  max_entries 1  memlock 4096B
	btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
..
alpha alpha: use pgtable-nopud instead of 4level-fixup 2019-12-04 19:44:14 -08:00
arc treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
arm Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-12-27 14:20:10 -08:00
arm64 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-12-27 14:20:10 -08:00
c6x c6x: use pgtable-nopud instead of 4level-fixup 2019-12-04 19:44:15 -08:00
csky dma-mapping updates for 5.5-rc1 2019-11-28 11:16:43 -08:00
h8300 h8300: Move EXCEPTION_TABLE to RO_DATA segment 2019-11-04 18:12:55 +01:00
hexagon Kbuild updates for v5.5 2019-12-02 17:35:04 -08:00
ia64 drm msm + fixes for 5.5-rc1 2019-12-06 10:28:09 -08:00
m68k netdev: pass the stuck queue to the timeout handler 2019-12-12 21:38:57 -08:00
microblaze microblaze: use pgtable-nopmd instead of 4level-fixup 2019-12-04 19:44:15 -08:00
mips Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-12-22 15:15:05 -08:00
nds32 nds32: use pgtable-nopmd instead of 4level-fixup 2019-12-04 19:44:15 -08:00
nios2 nios2: Fix ioremap 2019-12-12 16:34:33 +08:00
openrisc OpenRISC updates for 5.5 2019-12-02 17:18:43 -08:00
parisc parisc: Fix compiler warnings in debug_core.c 2019-12-20 21:01:42 +01:00
powerpc PPC: 2019-12-22 10:26:59 -08:00
riscv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-12-31 13:37:13 -08:00
s390 s390/ftrace: save traced function caller 2019-12-18 23:29:26 +01:00
sh Wimplicit-fallthrough patches for 5.5-rc2 2019-12-14 16:10:22 -08:00
sparc treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
um netdev: pass the stuck queue to the timeout handler 2019-12-12 21:38:57 -08:00
unicore32 generic ioremap support 2019-11-28 10:57:12 -08:00
x86 bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS 2020-01-09 08:46:18 -08:00
xtensa netdev: pass the stuck queue to the timeout handler 2019-12-12 21:38:57 -08:00
.gitignore
Kconfig arch/Kconfig: fix indentation 2019-12-04 19:44:12 -08:00