Commit Graph

38740 Commits

Author SHA1 Message Date
Mykola Lysenko
7df5072cc0 bpf: Small BPF verifier log improvements
In particular these include:

  1) Remove output of inv for scalars in print_verifier_state
  2) Replace inv with scalar in verifier error messages
  3) Remove _value suffixes for umin/umax/s32_min/etc (except map_value)
  4) Remove output of id=0
  5) Remove output of ref_obj_id=0

Signed-off-by: Mykola Lysenko <mykolal@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220301222745.1667206-1-mykolal@fb.com
2022-03-03 16:54:10 +01:00
Randy Dunlap
80e4390981 dma-debug: fix return value of __setup handlers
When valid kernel command line parameters
  dma_debug=off dma_debug_entries=100
are used, they are reported as Unknown parameters and added to init's
environment strings, polluting it.

  Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc5
    dma_debug=off dma_debug_entries=100", will be passed to user space.

and

 Run /sbin/init as init process
   with arguments:
     /sbin/init
   with environment:
     HOME=/
     TERM=linux
     BOOT_IMAGE=/boot/bzImage-517rc5
     dma_debug=off
     dma_debug_entries=100

Return 1 from these __setup handlers to indicate that the command line
option has been handled.

Fixes: 59d3daafa1 ("dma-debug: add kernel command line parameters")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-03 14:01:45 +03:00
Christoph Hellwig
f5ff79fddf dma-mapping: remove CONFIG_DMA_REMAP
CONFIG_DMA_REMAP is used to build a few helpers around the core
vmalloc code, and to use them in case there is a highmem page in
dma-direct, and to make dma coherent allocations be able to use
non-contiguous pages allocations for DMA allocations in the dma-iommu
layer.

Right now it needs to be explicitly selected by architectures, and
is only done so by architectures that require remapping to deal
with devices that are not DMA coherent.  Make it unconditional for
builds with CONFIG_MMU as it is very little extra code, but makes
it much more likely that large DMA allocations succeed on x86.

This fixes hot plugging a NVMe thunderbolt SSD for me, which tries
to allocate a 1MB buffer that is otherwise hard to obtain due to
memory fragmentation on a heavily used laptop.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2022-03-03 14:00:57 +03:00
Linus Torvalds
5859a2b199 Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fix from Eric Biederman:
 "Etienne Dechamps recently found a regression caused by enforcing
  RLIMIT_NPROC for root where the rlimit was not previously enforced.

  Michal Koutný had previously pointed out the inconsistency in
  enforcing the RLIMIT_NPROC that had been on the root owned process
  after the root user creates a user namespace.

  Which makes the fix for the regression simply removing the
  inconsistency"

* 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Fix systemd LimitNPROC with private users regression
2022-03-02 16:20:04 -08:00
Song Liu
676b2daaba bpf, x86: Set header->size properly before freeing it
On do_jit failure path, the header is freed by bpf_jit_binary_pack_free.
While bpf_jit_binary_pack_free doesn't require proper ro_header->size,
bpf_prog_pack_free still uses it. Set header->size in bpf_int_jit_compile
before calling bpf_jit_binary_pack_free.

Fixes: 1022a5498f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220302175126.247459-3-song@kernel.org
2022-03-02 13:24:37 -08:00
Randy Dunlap
b665eae7a7 printk: fix return value of printk.devkmsg __setup handler
If an invalid option value is used with "printk.devkmsg=<value>",
it is silently ignored.
If a valid option value is used, it is honored but the wrong return
value (0) is used, indicating that the command line option had an
error and was not handled. This string is not added to init's
environment strings due to init/main.c::unknown_bootoption()
checking for a '.' in the boot option string and then considering
that string to be an "Unused module parameter".

Print a warning message if a bad option string is used.
Always return 1 from the __setup handler to indicate that the command
line option has been handled.

Fixes: 750afe7bab ("printk: add kernel parameter to control writes to /dev/kmsg")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: John Ogness <john.ogness@linutronix.de>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220228220556.23484-1-rdunlap@infradead.org
2022-03-02 09:01:37 +01:00
Steven Rostedt (Google)
1d1898f656 tracing/histogram: Fix sorting on old "cpu" value
When trying to add a histogram against an event with the "cpu" field, it
was impossible due to "cpu" being a keyword to key off of the running CPU.
So to fix this, it was changed to "common_cpu" to match the other generic
fields (like "common_pid"). But since some scripts used "cpu" for keying
off of the CPU (for events that did not have "cpu" as a field, which is
most of them), a backward compatibility trick was added such that if "cpu"
was used as a key, and the event did not have "cpu" as a field name, then
it would fallback and switch over to "common_cpu".

This fix has a couple of subtle bugs. One was that when switching over to
"common_cpu", it did not change the field name, it just set a flag. But
the code still found a "cpu" field. The "cpu" field is used for filtering
and is returned when the event does not have a "cpu" field.

This was found by:

  # cd /sys/kernel/tracing
  # echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger
  # cat events/sched/sched_wakeup/hist

Which showed the histogram unsorted:

{ cpu:         19, pid:       1175 } hitcount:          1
{ cpu:          6, pid:        239 } hitcount:          2
{ cpu:         23, pid:       1186 } hitcount:         14
{ cpu:         12, pid:        249 } hitcount:          2
{ cpu:          3, pid:        994 } hitcount:          5

Instead of hard coding the "cpu" checks, take advantage of the fact that
trace_event_field_field() returns a special field for "cpu" and "CPU" if
the event does not have "cpu" as a field. This special field has the
"filter_type" of "FILTER_CPU". Check that to test if the returned field is
of the CPU type instead of doing the string compare.

Also, fix the sorting bug by testing for the hist_field flag of
HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use
the special CPU field to know what compare routine to use, and since that
special field does not have a size, it returns tracing_map_cmp_none.

Cc: stable@vger.kernel.org
Fixes: 1e3bac71c5 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Reported-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-01 22:48:30 -05:00
Randy Dunlap
7a64ca17e4 PM: suspend: fix return value of __setup handler
If an invalid option is given for "test_suspend=<option>", the entire
string is added to init's environment, so return 1 instead of 0 from
the __setup handler.

  Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc5
    test_suspend=invalid"

and

 Run /sbin/init as init process
   with arguments:
     /sbin/init
   with environment:
     HOME=/
     TERM=linux
     BOOT_IMAGE=/boot/bzImage-517rc5
     test_suspend=invalid

Fixes: 2ce986892f ("PM / sleep: Enhance test_suspend option with repeat capability")
Fixes: 27ddcc6596 ("PM / sleep: Add state field to pm_states[] entries")
Fixes: a9d7052363 ("PM: Separate suspend to RAM functionality from core")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-01 18:55:07 +01:00
Randy Dunlap
ba7ffcd4c4 PM: hibernate: fix __setup handler error handling
If an invalid value is used in "resumedelay=<seconds>", it is
silently ignored. Add a warning message and then let the __setup
handler return 1 to indicate that the kernel command line option
has been handled.

Fixes: 317cf7e5e8 ("PM / hibernate: convert simple_strtoul to kstrtoul")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-01 18:51:11 +01:00
Jiapeng Chong
444e1154b2 PM: hibernate: Clean up non-kernel-doc comments
Address the following W=1 kernel build warning:

kernel/power/swap.c:120: warning: This comment starts with '/**', but
isn't a kernel-doc comment. Refer
Documentation/doc-guide/kernel-doc.rst.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-01 16:20:46 +01:00
Valentin Schneider
fa2c3254d7 sched/tracing: Don't re-read p->state when emitting sched_switch event
As of commit

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

the following sequence becomes possible:

		      p->__state = TASK_INTERRUPTIBLE;
		      __schedule()
			deactivate_task(p);
  ttwu()
    READ !p->on_rq
    p->__state=TASK_WAKING
			trace_sched_switch()
			  __trace_sched_switch_state()
			    task_state_index()
			      return 0;

TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.

Prevent this by pushing the value read from __schedule() down the trace
event.

Reported-by: Abhijeet Dharmapurikar <adharmap@quicinc.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220120162520.570782-2-valentin.schneider@arm.com
2022-03-01 16:18:39 +01:00
Valentin Schneider
49bef33e4b sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
John reported that push_rt_task() can end up invoking
find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS
one), which causes mayhem down convert_prio().

This can happen when current gets demoted to e.g. CFS when releasing an
rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before
getting the chance to reschedule. Exactly who triggers this work isn't
entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task()
if there are no RT tasks on the local RQ, which means the local CPU can't
be in the rto_mask.

My current suspected sequence is something along the lines of the below,
with the demoted task being current.

  mark_wakeup_next_waiter()
    rt_mutex_adjust_prio()
      rt_mutex_setprio() // deboost originally-CFS task
	check_class_changed()
	  switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running
	  switched_to_fair() // Sets need_resched
      __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above
      raw_spin_rq_unlock(rq)

       // need_resched is set, so task_woken_rt() can't
       // invoke push_rt_tasks(). Best I can come up with is
       // local CPU has rt_nr_migratory >= 2 after the demotion, so stays
       // in the rto_mask, and then:

       <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU>
	 push_rt_task()
	   // breakage follows here as rq->curr is CFS

Move an existing check to check rq->curr vs the next pushable task's
priority before getting anywhere near find_lowest_rq(). While at it, add an
explicit sched_class of rq->curr check prior to invoking
find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless
of next_task's migratability.

Fixes: a7c81556ec ("sched: Fix migrate_disable() vs rt/dl balancing")
Reported-by: John Keeping <john@metanate.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: John Keeping <john@metanate.com>
Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com
2022-03-01 16:18:38 +01:00
Chengming Zhou
3eba0505d0 sched/cpuacct: Remove redundant RCU read lock
The cpuacct_account_field() and it's cgroup v2 wrapper
cgroup_account_cputime_field() is only called from cputime
in task_group_account_field(), which is already in RCU read-side
critical section. So remove these redundant RCU read lock.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220220051426.5274-3-zhouchengming@bytedance.com
2022-03-01 16:18:38 +01:00
Chengming Zhou
dc6e0818bc sched/cpuacct: Optimize away RCU read lock
Since cpuacct_charge() is called from the scheduler update_curr(),
we must already have rq lock held, then the RCU read lock can
be optimized away.

And do the same thing in it's wrapper cgroup_account_cputime(),
but we can't use lockdep_assert_rq_held() there, which defined
in kernel/sched/sched.h.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220220051426.5274-2-zhouchengming@bytedance.com
2022-03-01 16:18:38 +01:00
Chengming Zhou
248cc9993d sched/cpuacct: Fix charge percpu cpuusage
The cpuacct_account_field() is always called by the current task
itself, so it's ok to use __this_cpu_add() to charge the tick time.

But cpuacct_charge() maybe called by update_curr() in load_balance()
on a random CPU, different from the CPU on which the task is running.
So __this_cpu_add() will charge that cputime to a random incorrect CPU.

Fixes: 73e6aafd9e ("sched/cpuacct: Simplify the cpuacct code")
Reported-by: Minye Zhu <zhuminye@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com
2022-03-01 16:18:37 +01:00
Tiezhu Yang
b664e255ba bpf: Add some description about BPF_JIT_ALWAYS_ON in Kconfig
When CONFIG_BPF_JIT_ALWAYS_ON is enabled, /proc/sys/net/core/bpf_jit_enable
is permanently set to 1 and setting any other value than that will return
failure.

Add the above description in the help text of config BPF_JIT_ALWAYS_ON, and
then we can distinguish between BPF_JIT_ALWAYS_ON and BPF_JIT_DEFAULT_ON.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1645523826-18149-2-git-send-email-yangtiezhu@loongson.cn
2022-03-01 00:28:06 +01:00
Rafael J. Wysocki
075c3c483c Merge back cpufreq changes for v5.18. 2022-02-28 20:47:57 +01:00
Hao Luo
ceac059ed4 bpf: Cache the last valid build_id
For binaries that are statically linked, consecutive stack frames are
likely to be in the same VMA and therefore have the same build id.

On a real-world workload, we observed that 66% of CPU cycles in
__bpf_get_stackid() were spent on build_id_parse() and find_vma().

As an optimization for this case, we can cache the previous frame's
VMA, if the new frame has the same VMA as the previous one, reuse the
previous one's build id.

We are holding the MM locks as reader across the entire loop, so we
don't need to worry about VMA going away.

Tested through "stacktrace_build_id" and "stacktrace_build_id_nmi" in
test_progs.

Suggested-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/bpf/20220224000531.1265030-1-haoluo@google.com
2022-02-28 18:10:28 +01:00
Chuck Lever
f49169c97f NFSD: Remove svc_serv_ops::svo_module
struct svc_serv_ops is about to be removed.

Neil Brown says:
> I suspect svo_module can go as well - I don't think the thread is
> ever the thing that primarily keeps a module active.

A random sample of kthread_create() callers shows sunrpc is the only
one that manages module reference count in this way.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:40 -05:00
Steven Rostedt (Google)
c6ced22997 tracing: Update print fmt check to handle new __get_sockaddr() macro
A helper macro was added to make reading socket addresses easier in trace
events. It pairs %pISpc with __get_sockaddr() that reads the socket
address from the ring buffer into a human readable format.

The boot up check that makes sure that trace events do not reference
pointers to memory that can later be freed when the trace event is read,
incorrectly flagged this as a delayed reference. Update the check to
handle "__get_sockaddr" and not report an error on it.

Link: https://lore.kernel.org/all/20220125160505.068dbb52@canb.auug.org.au/

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28 10:26:38 -05:00
Yu Kuai
3093929326 blktrace: fix use after free for struct blk_trace
When tracing the whole disk, 'dropped' and 'msg' will be created
under 'q->debugfs_dir' and 'bt->dir' is NULL, thus blk_trace_free()
won't remove those files. What's worse, the following UAF can be
triggered because of accessing stale 'dropped' and 'msg':

==================================================================
BUG: KASAN: use-after-free in blk_dropped_read+0x89/0x100
Read of size 4 at addr ffff88816912f3d8 by task blktrace/1188

CPU: 27 PID: 1188 Comm: blktrace Not tainted 5.17.0-rc4-next-20220217+ #469
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-4
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 print_address_description.constprop.0.cold+0xab/0x381
 ? blk_dropped_read+0x89/0x100
 ? blk_dropped_read+0x89/0x100
 kasan_report.cold+0x83/0xdf
 ? blk_dropped_read+0x89/0x100
 kasan_check_range+0x140/0x1b0
 blk_dropped_read+0x89/0x100
 ? blk_create_buf_file_callback+0x20/0x20
 ? kmem_cache_free+0xa1/0x500
 ? do_sys_openat2+0x258/0x460
 full_proxy_read+0x8f/0xc0
 vfs_read+0xc6/0x260
 ksys_read+0xb9/0x150
 ? vfs_write+0x3d0/0x3d0
 ? fpregs_assert_state_consistent+0x55/0x60
 ? exit_to_user_mode_prepare+0x39/0x1e0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fbc080d92fd
Code: ce 20 00 00 75 10 b8 00 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 1
RSP: 002b:00007fbb95ff9cb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 00007fbb95ff9dc0 RCX: 00007fbc080d92fd
RDX: 0000000000000100 RSI: 00007fbb95ff9cc0 RDI: 0000000000000045
RBP: 0000000000000045 R08: 0000000000406299 R09: 00000000fffffffd
R10: 000000000153afa0 R11: 0000000000000293 R12: 00007fbb780008c0
R13: 00007fbb78000938 R14: 0000000000608b30 R15: 00007fbb780029c8
 </TASK>

Allocated by task 1050:
 kasan_save_stack+0x1e/0x40
 __kasan_kmalloc+0x81/0xa0
 do_blk_trace_setup+0xcb/0x410
 __blk_trace_setup+0xac/0x130
 blk_trace_ioctl+0xe9/0x1c0
 blkdev_ioctl+0xf1/0x390
 __x64_sys_ioctl+0xa5/0xe0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 1050:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_set_free_info+0x20/0x30
 __kasan_slab_free+0x103/0x180
 kfree+0x9a/0x4c0
 __blk_trace_remove+0x53/0x70
 blk_trace_ioctl+0x199/0x1c0
 blkdev_common_ioctl+0x5e9/0xb30
 blkdev_ioctl+0x1a5/0x390
 __x64_sys_ioctl+0xa5/0xe0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88816912f380
 which belongs to the cache kmalloc-96 of size 96
The buggy address is located 88 bytes inside of
 96-byte region [ffff88816912f380, ffff88816912f3e0)
The buggy address belongs to the page:
page:000000009a1b4e7c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0f
flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000200 ffffea00044f1100 dead000000000002 ffff88810004c780
raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88816912f280: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff88816912f300: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>ffff88816912f380: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                                                    ^
 ffff88816912f400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff88816912f480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================

Fixes: c0ea57608b ("blktrace: remove debugfs file dentries from struct blk_trace")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220228034354.4047385-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-28 06:36:33 -07:00
Connor O'Brien
5e214f2e43 bpf: Add config to allow loading modules with BTF mismatches
BTF mismatch can occur for a separately-built module even when the ABI is
otherwise compatible and nothing else would prevent successfully loading.

Add a new Kconfig to control how mismatches are handled. By default, preserve
the current behavior of refusing to load the module. If MODULE_ALLOW_BTF_MISMATCH
is enabled, load the module but ignore its BTF information.

Suggested-by: Yonghong Song <yhs@fb.com>
Suggested-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/CAADnVQJ+OVPnBz8z3vNu8gKXX42jCUqfuvhWAyCQDu8N_yqqwQ@mail.gmail.com
Link: https://lore.kernel.org/bpf/20220223012814.1898677-1-connoro@google.com
2022-02-28 14:17:10 +01:00
Greg Kroah-Hartman
4a248f85b3 Merge 5.17-rc6 into driver-core-next
We need the driver core fix in here as well for future changes.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-28 07:45:41 +01:00
Greg Kroah-Hartman
085686fb84 Merge 5.17-rc6 into char-misc-next
We need the char-misc fixes in here.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-28 07:30:32 +01:00
Linus Torvalds
98f3e84f8d dma-mapping fix for Linux 5.17
- fix a swiotlb info leak (Halil Pasic)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmIbvm8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMcBg/8DD+QsKEEv+D/+bCSWZVbW1ekFDiDyEdO9xuSymxT
 pf56Dy693G0BoPZ1JFiN17LIOp3vsHQfwE4ZDo/OHmyDTcziepD40SZiHD3eqEDB
 cl7RXhAcoxnUMkgK+hIb6Q7t/4dXaC3+rVXUuL//xUjXZu7GML46tRzuDoVr9MZp
 MBSGdZlXIzfvpzNf+yVzpk7sZP/w4nrzyuzeE5T6fdipjN2o78PhWmZzqsHGujeI
 ptMuZQwwDuXXetQNiA/t0+DkeWCz/+ERI7k/6lAma14aKHjM1Jn8HcNqW0UGUuGV
 148V4wKFqhaOXnoWuodrgjo9AJkOVqh+YH4ui60MMOUvZ4MPh4+5muobybzndtNy
 3bAM3JbcNUH1/vyyazvwr22My/TapbydwPuXda63Ls/2WbnV66xbgFEm6S9GXC4X
 6+ynK5CzWaTF5INgpd6WjrEMPXGCi0hXM9yQmJvlI1I0muFv9mBXhW/wP7OE3Na8
 eQXuP/v+QqYaDVTsGTtySNPkwBstFa9GQUTghrnxKzk3giLVZfGPX5Bs4/+gl4gl
 rrGBNhUAnMOdvBSkLUf4hDgyGvsCiYtB1YX9QYxq2aRCbIT03164VnXB0lQRKnmz
 Yu547NO96tBnoIT9/gQYMTHvwPU1WjWi3jVbT2zpMKgG4DBUEEIH33P9S1vbgecY
 vLY=
 =bR6d
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - fix a swiotlb info leak (Halil Pasic)

* tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix info leak with DMA_FROM_DEVICE
2022-02-27 12:42:37 -08:00
Linus Torvalds
2293be58d6 Tracing fixes for 5.17:
- rtla (Real-Time Linux Analysis tool): fix typo in man page
 
  - rtla: Update API -e to -E before it is released
 
  - rlla: Error message fix and memory leak fix
 
  - Partially uninline trace event soft disable to shrink text
 
  - Fix function graph start up test
 
  - Have triggers affect the trace instance they are in and not top level
 
  - Have osnoise sleep in the units it says it uses
 
  - Remove unused ftrace stub function
 
  - Remove event probe redundant info from event in the buffer
 
  - Fix group ownership setting in tracefs
 
  - Ensure trace buffer is minimum size to prevent crashes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYho7XBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiOhAQDbCbEjIYwkGCpckuGgSQiMU4bAWUzk
 jCz9PoaTxoIWJwEAsLWrAPb0pDzNwdEKjiC3fJoUJhz3NwlEjJ7hQ3BxzAI=
 =iXOQ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - rtla (Real-Time Linux Analysis tool):
    - fix typo in man page
    - Update API -e to -E before it is released
    - Error message fix and memory leak fix

 - Partially uninline trace event soft disable to shrink text

 - Fix function graph start up test

 - Have triggers affect the trace instance they are in and not top level

 - Have osnoise sleep in the units it says it uses

 - Remove unused ftrace stub function

 - Remove event probe redundant info from event in the buffer

 - Fix group ownership setting in tracefs

 - Ensure trace buffer is minimum size to prevent crashes

* tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  rtla/osnoise: Fix error message when failing to enable trace instance
  rtla/osnoise: Free params at the exit
  rtla/hist: Make -E the short version of --entries
  tracing: Fix selftest config check for function graph start up test
  tracefs: Set the group ownership in apply_options() not parse_options()
  tracing/osnoise: Make osnoise_main to sleep for microseconds
  ftrace: Remove unused ftrace_startup_enable() stub
  tracing: Ensure trace buffer is at least 4096 bytes large
  tracing: Uninline trace_trigger_soft_disabled() partly
  eprobes: Remove redundant event type information
  tracing: Have traceon and traceoff trigger honor the instance
  tracing: Dump stacktrace trigger to the corresponding instance
  rtla: Fix systme -> system typo on man page
2022-02-26 12:10:17 -08:00
Christophe Leroy
c5229a0bd4 tracing: Fix selftest config check for function graph start up test
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is required to test
direct tramp.

Link: https://lkml.kernel.org/r/bdc7e594e13b0891c1d61bc8d56c94b1890eaed7.1640017960.git.christophe.leroy@csgroup.eu

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 21:05:29 -05:00
Yucong Sun
80bebebdac bpf: Fix issue with bpf preload module taking over stdout/stdin of kernel.
In cb80ddc671 ("bpf: Convert bpf_preload.ko to use light skeleton.")
BPF preload was switched from user mode process to use in-kernel light
skeleton instead. However, in the kernel context, early in the boot
sequence, the first available FD can start from 0, instead of normally
3 for user mode process. So FDs 0 and 1 are then used for loaded BPF
programs and prevent init process from setting up stdin/stdout/stderr on
FD 0, 1, and 2 as expected.

Before the fix:

ls -lah /proc/1/fd/*

lrwx------1 root root 64 Feb 23 17:20 /proc/1/fd/0 -> /dev/null
lrwx------ 1 root root 64 Feb 23 17:20 /proc/1/fd/1 -> /dev/null
lrwx------ 1 root root 64 Feb 23 17:20 /proc/1/fd/2 -> /dev/console
lrwx------ 1 root root 64 Feb 23 17:20 /proc/1/fd/6 -> /dev/console
lrwx------ 1 root root 64 Feb 23 17:20 /proc/1/fd/7 -> /dev/console

After the fix:

ls -lah /proc/1/fd/*

lrwx------ 1 root root 64 Feb 24 21:23 /proc/1/fd/0 -> /dev/console
lrwx------ 1 root root 64 Feb 24 21:23 /proc/1/fd/1 -> /dev/console
lrwx------ 1 root root 64 Feb 24 21:23 /proc/1/fd/2 -> /dev/console

Fix by closing prog FDs after initialization. struct bpf_prog's
themselves are kept alive through direct kernel references taken with
bpf_link_get_from_fd().

Fixes: cb80ddc671 ("bpf: Convert bpf_preload.ko to use light skeleton.")
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220225185923.2535519-1-fallentree@fb.com
2022-02-25 12:48:35 -08:00
Daniel Bristot de Oliveira
dd990352f0 tracing/osnoise: Make osnoise_main to sleep for microseconds
osnoise's runtime and period are in the microseconds scale, but it is
currently sleeping in the millisecond's scale. This behavior roots in the
usage of hwlat as the skeleton for osnoise.

Make osnoise to sleep in the microseconds scale. Also, move the sleep to
a specialized function.

Link: https://lkml.kernel.org/r/302aa6c7bdf2d131719b22901905e9da122a11b2.1645197336.git.bristot@kernel.org

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:07:01 -05:00
Nathan Chancellor
ab2f993c01 ftrace: Remove unused ftrace_startup_enable() stub
When building with clang + CONFIG_DYNAMIC_FTRACE=n + W=1, there is a
warning:

  kernel/trace/ftrace.c:7194:20: error: unused function 'ftrace_startup_enable' [-Werror,-Wunused-function]
  static inline void ftrace_startup_enable(int command) { }
                     ^
  1 error generated.

Clang warns on instances of static inline functions in .c files with W=1
after commit 6863f5643d ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

The ftrace_startup_enable() stub has been unused since
commit e1effa0144 ("ftrace: Annotate the ops operation on update"),
where its use outside of the CONFIG_DYNAMIC_TRACE section was replaced
by ftrace_startup_all().  Remove it to resolve the warning.

Link: https://lkml.kernel.org/r/20220214192847.488166-1-nathan@kernel.org

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:07:01 -05:00
Sven Schnelle
7acf3a127b tracing: Ensure trace buffer is at least 4096 bytes large
Booting the kernel with 'trace_buf_size=1' give a warning at
boot during the ftrace selftests:

[    0.892809] Running postponed tracer tests:
[    0.892893] Testing tracer function:
[    0.901899] Callback from call_rcu_tasks_trace() invoked.
[    0.983829] Callback from call_rcu_tasks_rude() invoked.
[    1.072003] .. bad ring buffer .. corrupted trace buffer ..
[    1.091944] Callback from call_rcu_tasks() invoked.
[    1.097695] PASSED
[    1.097701] Testing dynamic ftrace: .. filter failed count=0 ..FAILED!
[    1.353474] ------------[ cut here ]------------
[    1.353478] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:1951 run_tracer_selftest+0x13c/0x1b0

Therefore enforce a minimum of 4096 bytes to make the selftest pass.

Link: https://lkml.kernel.org/r/20220214134456.1751749-1-svens@linux.ibm.com

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:07:01 -05:00
Christophe Leroy
bc82c38a69 tracing: Uninline trace_trigger_soft_disabled() partly
On a powerpc32 build with CONFIG_CC_OPTIMISE_FOR_SIZE, the inline
keyword is not honored and trace_trigger_soft_disabled() appears
approx 50 times in vmlinux.

Adding -Winline to the build, the following message appears:

	./include/linux/trace_events.h:712:1: error: inlining failed in call to 'trace_trigger_soft_disabled': call is unlikely and code size would grow [-Werror=inline]

That function is rather big for an inlined function:

	c003df60 <trace_trigger_soft_disabled>:
	c003df60:	94 21 ff f0 	stwu    r1,-16(r1)
	c003df64:	7c 08 02 a6 	mflr    r0
	c003df68:	90 01 00 14 	stw     r0,20(r1)
	c003df6c:	bf c1 00 08 	stmw    r30,8(r1)
	c003df70:	83 e3 00 24 	lwz     r31,36(r3)
	c003df74:	73 e9 01 00 	andi.   r9,r31,256
	c003df78:	41 82 00 10 	beq     c003df88 <trace_trigger_soft_disabled+0x28>
	c003df7c:	38 60 00 00 	li      r3,0
	c003df80:	39 61 00 10 	addi    r11,r1,16
	c003df84:	4b fd 60 ac 	b       c0014030 <_rest32gpr_30_x>
	c003df88:	73 e9 00 80 	andi.   r9,r31,128
	c003df8c:	7c 7e 1b 78 	mr      r30,r3
	c003df90:	41 a2 00 14 	beq     c003dfa4 <trace_trigger_soft_disabled+0x44>
	c003df94:	38 c0 00 00 	li      r6,0
	c003df98:	38 a0 00 00 	li      r5,0
	c003df9c:	38 80 00 00 	li      r4,0
	c003dfa0:	48 05 c5 f1 	bl      c009a590 <event_triggers_call>
	c003dfa4:	73 e9 00 40 	andi.   r9,r31,64
	c003dfa8:	40 82 00 28 	bne     c003dfd0 <trace_trigger_soft_disabled+0x70>
	c003dfac:	73 ff 02 00 	andi.   r31,r31,512
	c003dfb0:	41 82 ff cc 	beq     c003df7c <trace_trigger_soft_disabled+0x1c>
	c003dfb4:	80 01 00 14 	lwz     r0,20(r1)
	c003dfb8:	83 e1 00 0c 	lwz     r31,12(r1)
	c003dfbc:	7f c3 f3 78 	mr      r3,r30
	c003dfc0:	83 c1 00 08 	lwz     r30,8(r1)
	c003dfc4:	7c 08 03 a6 	mtlr    r0
	c003dfc8:	38 21 00 10 	addi    r1,r1,16
	c003dfcc:	48 05 6f 6c 	b       c0094f38 <trace_event_ignore_this_pid>
	c003dfd0:	38 60 00 01 	li      r3,1
	c003dfd4:	4b ff ff ac 	b       c003df80 <trace_trigger_soft_disabled+0x20>

However it is located in a hot path so inlining it is important.
But forcing inlining of the entire function by using __always_inline
leads to increasing the text size by approx 20 kbytes.

Instead, split the fonction in two parts, one part with the likely
fast path, flagged __always_inline, and a second part out of line.

With this change, on a powerpc32 with CONFIG_CC_OPTIMISE_FOR_SIZE
vmlinux text increases by only 1,4 kbytes, which is partly
compensated by a decrease of vmlinux data by 7 kbytes.

On ppc64_defconfig which has CONFIG_CC_OPTIMISE_FOR_SPEED, this
change reduces vmlinux text by more than 30 kbytes.

Link: https://lkml.kernel.org/r/69ce0986a52d026d381d612801d978aa4f977460.1644563295.git.christophe.leroy@csgroup.eu

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:07:01 -05:00
Steven Rostedt (Google)
b61edd5774 eprobes: Remove redundant event type information
Currently, the event probes save the type of the event they are attached
to when recording the event. For example:

  # echo 'e:switch sched/sched_switch prev_state=$prev_state prev_prio=$prev_prio next_pid=$next_pid next_prio=$next_prio' > dynamic_events
  # cat events/eprobes/switch/format

 name: switch
 ID: 1717
 format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:unsigned int __probe_type;        offset:8;       size:4; signed:0;
        field:u64 prev_state;   offset:12;      size:8; signed:0;
        field:u64 prev_prio;    offset:20;      size:8; signed:0;
        field:u64 next_pid;     offset:28;      size:8; signed:0;
        field:u64 next_prio;    offset:36;      size:8; signed:0;

 print fmt: "(%u) prev_state=0x%Lx prev_prio=0x%Lx next_pid=0x%Lx next_prio=0x%Lx", REC->__probe_type, REC->prev_state, REC->prev_prio, REC->next_pid, REC->next_prio

The __probe_type adds 4 bytes to every event.

One of the reasons for creating eprobes is to limit what is traced in an
event to be able to limit what is written into the ring buffer. Having
this redundant 4 bytes to every event takes away from this.

The event that is recorded can be retrieved from the event probe itself,
that is available when the trace is happening. For user space tools, it
could simply read the dynamic_event file to find the event they are for.
So there is really no reason to write this information into the ring
buffer for every event.

Link: https://lkml.kernel.org/r/20220218190057.2f5a19a8@gandalf.local.home

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:07:01 -05:00
Steven Rostedt (Google)
302e9edd54 tracing: Have traceon and traceoff trigger honor the instance
If a trigger is set on an event to disable or enable tracing within an
instance, then tracing should be disabled or enabled in the instance and
not at the top level, which is confusing to users.

Link: https://lkml.kernel.org/r/20220223223837.14f94ec3@rorschach.local.home

Cc: stable@vger.kernel.org
Fixes: ae63b31e4d ("tracing: Separate out trace events from global variables")
Tested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-25 12:06:45 -05:00
Eric W. Biederman
0ac983f512 ucounts: Fix systemd LimitNPROC with private users regression
Long story short recursively enforcing RLIMIT_NPROC when it is not
enforced on the process that creates a new user namespace, causes
currently working code to fail.  There is no reason to enforce
RLIMIT_NPROC recursively when we don't enforce it normally so update
the code to detect this case.

I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
RLIMIT_NPROC is charged and checked for enforcement based upon the
real uid, using capable() which is euid based is inconsistent with reality.
Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
testing for when the real uid would match the conditions when
CAP_SYS_RESOURCE would be present if the real uid was the effective
uid.

Reported-by: Etienne Dechamps <etienne@edechamps.fr>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
Link: https://lkml.kernel.org/r/87sfs8jmpz.fsf_-_@email.froward.int.ebiederm.org
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-25 10:40:14 -06:00
Arnd Bergmann
543f7961c2 This cleans out the remaining board files from IXP4xx and
makes it an exclusive device tree subarchitecture without any
 special weirdness in arch/arm/mach-ixp4xx.
 
 The biggest noticeable change is the removal of the old PCI
 driver and along with that the removal of the special DMA
 coherency code and defines and the DMA bouncing.
 
 I tried to convert the IXP4xx to multiplatform on top of
 this but it didn't work because IXP4xx wants to be big
 endian and multiplatform config creates a problem like
 this:
 
 ../arch/arm/kernel/head.S: Assembler messages:
 ../arch/arm/kernel/head.S:94: Error: selected processor does not support `setend be' in ARM mode
 
 I think this is because MULTI_V5 turns on CPUs that cannot
 do big endian, and IXP4xx turn on big endian. (It crashes if
 I try to boot in little endian mode, sorry. It really wants
 to run big endian.)
 
 But before fixing multiplatform we can fix all of this!
 
 The networking patches are dependencies so I am requesting
 ACKs from the network maintainers on these.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmIH7KsACgkQQRCzN7AZ
 XXNGDQ//dfQIiKC0uE4UEbKeDG18c1NpSDHomQQ/xGjnl9nenmOzsqg9pu+hrD+y
 eHNp55PmL8WKjXYV8wrbQd65BsnMGpiIh4CoT17BOM0vjQ7+l0KM9FqCuYTIXkyE
 +AR8ZbOYZDSNZA6bKCy8dT1tYZakx2IK02YEphiAU2hPsvWYTE2VTDSYzskbW7pu
 +aaCzPOnteQZM4cRMPDOmhq9FedaCnluom9ldTei0nEkFuDi0v947zUVs8M3o7gi
 rsZBRxfXlY86G3b8WdqL0jCcrsWsnLnP6TehSH7r/TbsFlzdv+GQpfZ8craDpTTW
 X5zhUQYreAYNY88Vy1UoVSci6BSJF/c8Q75lcLOcIuf/XX6ufyTO1YHDXxxSm04P
 1cjunZzJs/YGx2dDgVfzSIXTBCMKovZcUVI5FfEP36wW3RA1WSa4cdWnI9alwlp4
 yX7AtOtbxIZ2jiNQMV4qx9Yz8SzlVWF7E63S16XAiFWr8BQex1Lmh2u5BZLE03XW
 k9yl+dhqmTf+MlHS7y70KdwxY0XvJNZiHoCEIA2A0B7ayvl9ThkXOWOkYBZLb0hq
 SmlIHJkRyFLNxtVTvadq3S8fa1haS9qTONwfS7KCkCPp8XpE4QCDCEAOK9EK0tCS
 p4SRGKD6SpoqEZKVItacGfLLAjJeGt6kIDBdRJ0J/nASVgcvaT8=
 =DMVU
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmIY9nQACgkQmmx57+YA
 GNkM8RAAqwYUKHl1pkAnTBU60aW9dWJpNdyIej22sdi+FcsGcF4tC/RQ+Zssl0VD
 eMvAM295eGZDb02B0f8EbPQHuDEp7Y9mpC+9QjoBCLCojho15ise3cKmyloG9/7T
 xo1J99lY/PW7ociVXqO4rZtWsplgy+FB3X+cCsk/UziwsF+HqjSSOi9VsEkHsBHI
 sJwDE90YetuC3S2+5/G9z2JbuqNchGebT66ZE13CGGBLb1cKBNMb2L1ZjwRlbhUK
 tYPBaTBnwJApLKoTQx4HlpAei5XROEB/3qxPH4e4RnW8MZyG7+ZNXnkIuZytJxVc
 zt5Qols2bLakHms0QdZhpeinvqPV4c88xGCZryQRHqbU1VLfygFFuoqnDfn/l0vW
 KWceaSMeL5EpvE+1sZsP7EcQn5AhCKckWoaL3TRJzeR/0JJswgBMMSBYof/eSP4u
 jJX+NBid6Hjred7B12QBFYc7L3LCDcmFmQM9LYa7IciedRx3fqw9yNRXzBVgTUDC
 t/cuLn06uLYTbn5fNnxxRvvDkLnXtf8OyzER0fy+Dm+terjIMATK3iQ0bGvwj46Y
 sQ74A7mQ+tsosmS+SEz79VD3AuzCqm3nfFTeU8W0QUj3IK1zqWKEjmJpeeniuL8L
 QBP7ZR8wlytfwq4fm2gI3wss1gFw8GYcJlkPhBAuHjMisiFWl10=
 =blMU
 -----END PGP SIGNATURE-----

Merge tag 'ixp4xx-cleanup-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik into arm/soc

This cleans out the remaining board files from IXP4xx and
makes it an exclusive device tree subarchitecture without any
special weirdness in arch/arm/mach-ixp4xx.

The biggest noticeable change is the removal of the old PCI
driver and along with that the removal of the special DMA
coherency code and defines and the DMA bouncing.

I tried to convert the IXP4xx to multiplatform on top of
this but it didn't work because IXP4xx wants to be big
endian and multiplatform config creates a problem like
this:

../arch/arm/kernel/head.S: Assembler messages:
../arch/arm/kernel/head.S:94: Error: selected processor does not support `setend be' in ARM mode

I think this is because MULTI_V5 turns on CPUs that cannot
do big endian, and IXP4xx turn on big endian. (It crashes if
I try to boot in little endian mode, sorry. It really wants
to run big endian.)

But before fixing multiplatform we can fix all of this!

The networking patches are dependencies so I am requesting
ACKs from the network maintainers on these.

* tag 'ixp4xx-cleanup-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-nomadik:
  ARM: ixp4xx: Convert to SPARSE_IRQ and P2V
  ARM: ixp4xx: Drop all common code
  ARM: ixp4xx: Drop custom DMA coherency and bouncing
  ARM: ixp4xx: Remove feature bit accessors
  net: ixp4xx_hss: Check features using syscon
  net: ixp4xx_eth: Drop platform data support
  soc: ixp4xx-npe: Access syscon regs using regmap
  soc: ixp4xx: Add features from regmap helper
  ARM: ixp4xx: Drop UDC info setting function
  ARM: ixp4xx: Drop stale Kconfig entry
  ARM: ixp4xx: Delete old PCI driver
  ARM: ixp4xx: Delete the Goramo MLR boardfile
  ARM: ixp4xx: Delete Gateway 7001 boardfiles

Link: https://lore.kernel.org/r/CACRpkdahK-jaHFqLCpSqiXwAtkSKbhWQZ9jaSo6rRzHfSiECkA@mail.gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-02-25 16:32:03 +01:00
Marijn Suijten
3bdd6d5ad5 config: android-recommended: Disable BPF_UNPRIV_DEFAULT_OFF for netd
AOSP's `netd` process fails to start on Android S:

    E ClatdController: getClatEgress4MapFd() failure: Operation not permitted
    I netd    : Initializing ClatdController: 410us
    E netd    : Failed to start trafficcontroller: (Status[code: 1, msg: "Pinned map not accessible or does not exist: (/sys/fs/bpf/map_netd_cookie_tag_map): Operation not permitted"])
    E netd    : CRITICAL: sleeping 60 seconds, netd exiting with failure, crash loop likely!

And on Android R:

    I ClatdController: 4.9+ kernel and device shipped with P - clat ebpf might work.
    E ClatdController: getClatEgressMapFd() failure: Operation not permitted
    I netd    : Initializing ClatdController: 1409us
    E netd    : Failed to start trafficcontroller: (Status[code: 1, msg: "Pinned map not accessible or does not exist: (/sys/fs/bpf/map_netd_cookie_tag_map): Operation not permitted"])

These permission issues are caused by 08389d8882 ("bpf: Add kconfig
knob for disabling unpriv bpf by default") because AOSP does not provide
netd the `SYS_ADMIN` capability, and also has no userspace support for
the `BPF` capability yet.

Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: John Stultz <john.stultz@linaro.org>
[John suggested this in https://linaro.atlassian.net/browse/ACK-107?focusedCommentId=117382]
Signed-off-by: Marijn Suijten <marijn.suijten@somainline.org>
Link: https://lore.kernel.org/r/20220202100528.190794-2-marijn.suijten@somainline.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-25 12:10:27 +01:00
Marijn Suijten
4c4559b43c config: android-recommended: Don't explicitly disable CONFIG_AIO
Android nowadays (for a couple years already) requires AIO for at least
its `adb` "Android Debug Bridge" [1].  Without this config option
(`default y`) it simply refuses start, making users unable to connect to
their phone for debugging purposes when using these kernel fragments.

[1]: a2cb8de5e6

Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Marijn Suijten <marijn.suijten@somainline.org>
Link: https://lore.kernel.org/r/20220202100528.190794-1-marijn.suijten@somainline.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-25 12:10:27 +01:00
Arnd Bergmann
967747bbc0 uaccess: remove CONFIG_SET_FS
There are no remaining callers of set_fs(), so CONFIG_SET_FS
can be removed globally, along with the thread_info field and
any references to it.

This turns access_ok() into a cheaper check against TASK_SIZE_MAX.

As CONFIG_SET_FS is now gone, drop all remaining references to
set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().

Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes
Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-02-25 09:36:06 +01:00
Daniel Bristot de Oliveira
ce33c845b0 tracing: Dump stacktrace trigger to the corresponding instance
The stacktrace event trigger is not dumping the stacktrace to the instance
where it was enabled, but to the global "instance."

Use the private_data, pointing to the trigger file, to figure out the
corresponding trace instance, and use it in the trigger action, like
snapshot_trigger does.

Link: https://lkml.kernel.org/r/afbb0b4f18ba92c276865bc97204d438473f4ebc.1645396236.git.bristot@kernel.org

Cc: stable@vger.kernel.org
Fixes: ae63b31e4d ("tracing: Separate out trace events from global variables")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-24 22:07:03 -05:00
Jakub Kicinski
aaa25a2fa7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/mptcp/mptcp_join.sh
  34aa6e3bcc ("selftests: mptcp: add ip mptcp wrappers")

  857898eb4b ("selftests: mptcp: add missing join check")
  6ef84b1517 ("selftests: mptcp: more robust signal race test")
https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/

drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h
drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c
  fb7e76ea3f ("net/mlx5e: TC, Skip redundant ct clear actions")
  c63741b426 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information")

  09bf979232 ("net/mlx5e: TC, Move pedit_headers_action to parse_attr")
  84ba8062e3 ("net/mlx5e: Test CT and SAMPLE on flow attr")
  efe6f961cd ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow")
  3b49a7edec ("net/mlx5e: TC, Reject rules with multiple CT actions")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24 17:54:25 -08:00
Linus Torvalds
f672ff9123 Networking fixes for 5.17-rc6, including fixes from bpf and netfilter.
Current release - regressions:
 
  - bpf: fix crash due to out of bounds access into reg2btf_ids
 
  - mvpp2: always set port pcs ops, avoid null-deref
 
  - eth: marvell: fix driver load from initrd
 
  - eth: intel: revert "Fix reset bw limit when DCB enabled with 1 TC"
 
 Current release - new code bugs:
 
  - mptcp: fix race in overlapping signal events
 
 Previous releases - regressions:
 
  - xen-netback: revert hotplug-status changes causing devices to
    not be configured
 
  - dsa:
    - avoid call to __dev_set_promiscuity() while rtnl_mutex isn't held
    - fix panic when removing unoffloaded port from bridge
 
  - dsa: microchip: fix bridging with more than two member ports
 
 Previous releases - always broken:
 
  - bpf:
   - fix crash due to incorrect copy_map_value when both spin lock
     and timer are present in a single value
   - fix a bpf_timer initialization issue with clang
   - do not try bpf_msg_push_data with len 0
   - add schedule points in batch ops
 
  - nf_tables:
    - unregister flowtable hooks on netns exit
    - correct flow offload action array size
    - fix a couple of memory leaks
 
  - vsock: don't check owner in vhost_vsock_stop() while releasing
 
  - gso: do not skip outer ip header in case of ipip and net_failover
 
  - smc: use a mutex for locking "struct smc_pnettable"
 
  - openvswitch: fix setting ipv6 fields causing hw csum failure
 
  - mptcp: fix race in incoming ADD_ADDR option processing
 
  - sysfs: add check for netdevice being present to speed_show
 
  - sched: act_ct: fix flow table lookup after ct clear or switching
    zones
 
  - eth: intel: fixes for SR-IOV forwarding offloads
 
  - eth: broadcom: fixes for selftests and error recovery
 
  - eth: mellanox: flow steering and SR-IOV forwarding fixes
 
 Misc:
 
  - make __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor
    friends not report freed skbs as drops
 
  - force inlining of checksum functions in net/checksum.h
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmIX2ssACgkQMUZtbf5S
 IrvImQ//b+JILp0M/jz6q25n5U7qxuNmJypq659kR19jnwGH520XTwnFE9/FB3gw
 UnlCb28+jdMX1HHQJaUKkKYTilfFvyMoRPAMbLFO51Y02dVALTjD7C2wJ1AyEiTV
 eKhOcGHLbDzLom3+FnK566adOlGsIZfr4bR4zlGcthU0wTvU6S2K3WTkVJMASJzJ
 JizNgN+SvpdpmnYj+wsg2cj/5W4R/IPdxCrkZMkEMomJnVxA61RV+wsCcsT+Cjrf
 wu+cknUiVIGQNtCT4hz8VZ3tOoAeX+Xg/4YbaxVxnvunTQh+D+eIza40IEqewlEq
 KFOXGuPXsse6ZJ7IqVZt1hgBxJ8bpItxEBNSgU3KqJKMTTKOpWWjZxkTYeIERMry
 Ywb/ciZ7pwbo2CNhICh6+xefQvGbU0jgsiMgSkQvXZ9b9IsdPM4bwgvjFsyqnEMz
 0HVpqN02F7MM44mD4P0TQct9OSemu6sVqQFrpk8+CvPfaSEctCv/iJ6WR/xxUgSp
 uPvKYlv7BqOKZtqzGOk215WEvTUf8dy9cxcQwoYBOBxs8h2XQSRXEWCsGWCOg5+V
 xLnlnreXHXKWcUrAmsJlZh6XmWGk9lBDqLX7hKCYZzMgU8nNopSDKKcDpVDkaBzC
 DrK8Y3y+lBhpBwCHt/GZw8Qg9aDDsczFpOfPZBVJy+jH+7AGK7M=
 =LT/x
 -----END PGP SIGNATURE-----

Merge tag 'net-5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - bpf: fix crash due to out of bounds access into reg2btf_ids

   - mvpp2: always set port pcs ops, avoid null-deref

   - eth: marvell: fix driver load from initrd

   - eth: intel: revert "Fix reset bw limit when DCB enabled with 1 TC"

  Current release - new code bugs:

   - mptcp: fix race in overlapping signal events

  Previous releases - regressions:

   - xen-netback: revert hotplug-status changes causing devices to not
     be configured

   - dsa:
      - avoid call to __dev_set_promiscuity() while rtnl_mutex isn't
        held
      - fix panic when removing unoffloaded port from bridge

   - dsa: microchip: fix bridging with more than two member ports

  Previous releases - always broken:

   - bpf:
      - fix crash due to incorrect copy_map_value when both spin lock
        and timer are present in a single value
      - fix a bpf_timer initialization issue with clang
      - do not try bpf_msg_push_data with len 0
      - add schedule points in batch ops

   - nf_tables:
      - unregister flowtable hooks on netns exit
      - correct flow offload action array size
      - fix a couple of memory leaks

   - vsock: don't check owner in vhost_vsock_stop() while releasing

   - gso: do not skip outer ip header in case of ipip and net_failover

   - smc: use a mutex for locking "struct smc_pnettable"

   - openvswitch: fix setting ipv6 fields causing hw csum failure

   - mptcp: fix race in incoming ADD_ADDR option processing

   - sysfs: add check for netdevice being present to speed_show

   - sched: act_ct: fix flow table lookup after ct clear or switching
     zones

   - eth: intel: fixes for SR-IOV forwarding offloads

   - eth: broadcom: fixes for selftests and error recovery

   - eth: mellanox: flow steering and SR-IOV forwarding fixes

  Misc:

   - make __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor
     friends not report freed skbs as drops

   - force inlining of checksum functions in net/checksum.h"

* tag 'net-5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
  net: mv643xx_eth: process retval from of_get_mac_address
  ping: remove pr_err from ping_lookup
  Revert "i40e: Fix reset bw limit when DCB enabled with 1 TC"
  openvswitch: Fix setting ipv6 fields causing hw csum failure
  ipv6: prevent a possible race condition with lifetimes
  net/smc: Use a mutex for locking "struct smc_pnettable"
  bnx2x: fix driver load from initrd
  Revert "xen-netback: Check for hotplug-status existence before watching"
  Revert "xen-netback: remove 'hotplug-status' once it has served its purpose"
  net/mlx5e: Fix VF min/max rate parameters interchange mistake
  net/mlx5e: Add missing increment of count
  net/mlx5e: MPLSoUDP decap, fix check for unsupported matches
  net/mlx5e: Fix MPLSoUDP encap to use MPLS action information
  net/mlx5e: Add feature check for set fec counters
  net/mlx5e: TC, Skip redundant ct clear actions
  net/mlx5e: TC, Reject rules with forward and drop actions
  net/mlx5e: TC, Reject rules with drop and modify hdr action
  net/mlx5e: kTLS, Use CHECKSUM_UNNECESSARY for device-offloaded packets
  net/mlx5e: Fix wrong return value on ioctl EEPROM query failure
  net/mlx5: Fix possible deadlock on rule deletion
  ...
2022-02-24 12:45:32 -08:00
Paul E. McKenney
d5578190be Merge branches 'exp.2022.02.24a', 'fixes.2022.02.14a', 'rcu_barrier.2022.02.08a', 'rcu-tasks.2022.02.08a', 'rt.2022.02.01b', 'torture.2022.02.01b' and 'torturescript.2022.02.08a' into HEAD
exp.2022.02.24a: Expedited grace-period updates.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Make rcu_barrier() no longer exclude CPU hotplug.
rcu-tasks.2022.02.08a: RCU-tasks updates.
rt.2022.02.01b: Real-time-related updates.
torture.2022.02.01b: Torture-test updates.
torturescript.2022.02.08a: Torture-test scripting updates.
2022-02-24 09:38:46 -08:00
Steven Rostedt (Google)
9f8e5aee93 tracing: Fix allocation of last_cmd in last_cmd_set()
The strncat() used in last_cmd_set() includes the nul byte of length of
the string being copied in, when it should only hold the size of the
string being copied (not the nul byte). Change it to subtract the length
of the allocated space and the nul byte to pass that into the strncat().

Also, assign "len" instead of initializing it to zero and its first update
is to do a "+=".

Link: https://lore.kernel.org/all/202202140628.fj6e4w4v-lkp@intel.com/

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-23 23:18:54 -05:00
Tom Rix
c561d11063 bpf: Cleanup comments
Add leading space to spdx tag
Use // for spdx c file comment

Replacements
resereved to reserved
inbetween to in between
everytime to every time
intutivie to intuitive
currenct to current
encontered to encountered
referenceing to referencing
upto to up to
exectuted to executed

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220220184055.3608317-1-trix@redhat.com
2022-02-23 15:17:51 -08:00
Greg Kroah-Hartman
f2eb478f2f kernfs: move struct kernfs_root out of the public view.
There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it.  Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.

Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23 15:46:34 +01:00
Ingo Molnar
4ff8f2ca6c sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
Remove all headers, except the ones required to make this header
build standalone.

Also include stats.h in sched.h explicitly - dependencies already
require this.

Summary of the build speedup gained through the last ~15 scheduler build &
header dependency patches:

Cumulative scheduler (kernel/sched/) build time speedup on a
Linux distribution's config, which enables all scheduler features,
compared to the vanilla kernel:

  _____________________________________________________________________________
 |
 |  Vanilla kernel (v5.13-rc7):
 |_____________________________________________________________________________
 |
 |  Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
 |
 |   126,975,564,374      instructions              #    1.45  insn per cycle           ( +-  0.00% )
 |    87,637,847,671      cycles                    #    3.959 GHz                      ( +-  0.30% )
 |         22,136.96 msec cpu-clock                 #    7.499 CPUs utilized            ( +-  0.29% )
 |
 |            2.9520 +- 0.0169 seconds time elapsed  ( +-  0.57% )
 |_____________________________________________________________________________
 |
 |  Patched kernel:
 |_____________________________________________________________________________
 |
 | Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
 |
 |    50,420,496,914      instructions              #    1.47  insn per cycle           ( +-  0.00% )
 |    34,234,322,038      cycles                    #    3.946 GHz                      ( +-  0.31% )
 |          8,675.81 msec cpu-clock                 #    3.053 CPUs utilized            ( +-  0.45% )
 |
 |            2.8420 +- 0.0181 seconds time elapsed  ( +-  0.64% )
 |_____________________________________________________________________________

Summary:

  - CPU time used to build the scheduler dropped by -60.9%, a reduction
    from 22.1 clock-seconds to 8.7 clock-seconds.

  - Wall-clock time to build the scheduler dropped by -3.9%, a reduction
    from 2.95 seconds to 2.84 seconds.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:34 +01:00
Ingo Molnar
e81daa7b64 sched/headers: Reorganize, clean up and optimize kernel/sched/build_utility.c dependencies
Use all generic headers from kernel/sched/sched.h that are required
for it to build.

Sort the sections alphabetically.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:34 +01:00
Ingo Molnar
0dda4eeb48 sched/headers: Reorganize, clean up and optimize kernel/sched/build_policy.c dependencies
Use all generic headers from kernel/sched/sched.h that are required
for it to build.

Sort the sections alphabetically.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
c4ad6fcb67 sched/headers: Reorganize, clean up and optimize kernel/sched/fair.c dependencies
Use all generic headers from kernel/sched/sched.h that are required
for it to build.

Sort the sections alphabetically.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
e66f6481a8 sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies
Use all generic headers from kernel/sched/sched.h that are required
for it to build.

Sort the sections alphabetically.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
b9e9c6ca6e sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included
in the middle of the header.

Two of them rely on being included in the middle of kernel/sched/sched.h,
due to definitions they require:

 - "stat.h" needs the rq definitions.
 - "autogroup.h" needs the task_group definition.

Move the inclusion of these two files out of kernel/sched/sched.h, and
include them in all files that require them.

Move of the rest of the header dependencies to the top of the
kernel/sched/sched.h file.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
f96eca4320 sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there
Similarly to kernel/sched/build_utility.c, collect all 'scheduling policy' related
source code files into kernel/sched/build_policy.c:

    kernel/sched/idle.c

    kernel/sched/rt.c

    kernel/sched/cpudeadline.c
    kernel/sched/pelt.c

    kernel/sched/cputime.c
    kernel/sched/deadline.c

With the exception of fair.c, which we continue to build as a separate file
for build efficiency and parallelism reasons.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
801c141955 sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Collect all utility functionality source code files into a single kernel/sched/build_utility.c file,
via #include-ing the .c files:

    kernel/sched/clock.c
    kernel/sched/completion.c
    kernel/sched/loadavg.c
    kernel/sched/swait.c
    kernel/sched/wait_bit.c
    kernel/sched/wait.c

CONFIG_CPU_FREQ:
    kernel/sched/cpufreq.c

CONFIG_CPU_FREQ_GOV_SCHEDUTIL:
    kernel/sched/cpufreq_schedutil.c

CONFIG_CGROUP_CPUACCT:
    kernel/sched/cpuacct.c

CONFIG_SCHED_DEBUG:
    kernel/sched/debug.c

CONFIG_SCHEDSTATS:
    kernel/sched/stats.c

CONFIG_SMP:
   kernel/sched/cpupri.c
   kernel/sched/stop_task.c
   kernel/sched/topology.c

CONFIG_SCHED_CORE:
   kernel/sched/core_sched.c

CONFIG_PSI:
   kernel/sched/psi.c

CONFIG_MEMBARRIER:
   kernel/sched/membarrier.c

CONFIG_CPU_ISOLATION:
   kernel/sched/isolation.c

CONFIG_SCHED_AUTOGROUP:
   kernel/sched/autogroup.c

The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into
a single build unit.

The build time of build_utility.c also roughly matches the build time of core.c and
fair.c - allowing better load-balancing of scheduler-only rebuilds.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
81de6572fe sched/headers: Fix comment typo in kernel/sched/cpudeadline.c
File name changed.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Ingo Molnar
fa28abed7a sched/headers: sched/clock: Mark all functions 'notrace', remove CC_FLAGS_FTRACE build asymmetry
Mark all non-init functions in kernel/sched.c as 'notrace', instead of
turning them all off via CC_FLAGS_FTRACE.

This is going to allow the treatment of this file as any other scheduler
file, and it can be #include-ed in compound compilation units as well.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 08:22:04 +01:00
Ingo Molnar
d90a2f160a sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h
Protect against multiple inclusion.

Also include "sched.h" in "stat.h", as it relies on it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 08:22:00 +01:00
Ingo Molnar
95458477f5 sched/headers: Add header guard to kernel/sched/sched.h
Use the canonical header guard naming of the full path to the header.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 08:21:56 +01:00
Christoph Hellwig
73bd66d9c8 scsi: block: Remove REQ_OP_WRITE_SAME support
No more users of REQ_OP_WRITE_SAME or drivers implementing it are left,
so remove the infrastructure.

[mkp: fold in and tweak sysfs reporting fix]

Link: https://lore.kernel.org/r/20220209082828.2629273-8-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-22 21:11:08 -05:00
Linus Torvalds
5c1ee56966 Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Fix for a subtle bug in the recent release_agent permission check
   update

 - Fix for a long-standing race condition between cpuset and cpu hotplug

 - Comment updates

* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: Fix kernel-doc
  cgroup-v1: Correct privileges check in release_agent writes
  cgroup: clarify cgroup_css_set_fork()
  cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
2022-02-22 16:14:35 -08:00
Sebastian Andrzej Siewior
0ce055f853 fork: Use IS_ENABLED() in account_kernel_stack()
Not strickly needed but checking CONFIG_VMAP_STACK instead of
task_stack_vm_area()' result allows the compiler the remove the else
path in the CONFIG_VMAP_STACK case where the pointer can't be NULL.

Check for CONFIG_VMAP_STACK in order to use the proper path.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-9-bigeasy@linutronix.de
2022-02-22 22:25:02 +01:00
Sebastian Andrzej Siewior
e540bf3162 fork: Only cache the VMAP stack in finish_task_switch()
The task stack could be deallocated later, but for fork()/exec() kind of
workloads (say a shell script executing several commands) it is important
that the stack is released in finish_task_switch() so that in VMAP_STACK
case it can be cached and reused in the new task.

For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could
be avoided in the scheduling path. Far worse are the other
free_thread_stack() implementations which invoke __free_pages()/
kmem_cache_free() with disabled preemption.

Cache the stack in free_thread_stack() in the VMAP_STACK case and
RCU-delay the free path otherwise. Free the stack in the RCU callback.
In the VMAP_STACK case this is another opportunity to fill the cache.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-8-bigeasy@linutronix.de
2022-02-22 22:25:02 +01:00
Sebastian Andrzej Siewior
1a03d3f13f fork: Move task stack accounting to do_exit()
There is no need to perform the stack accounting of the outgoing task in
its final schedule() invocation which happens with preemption disabled.
The task is leaving, the resources will be freed and the accounting can
happen in do_exit() before the actual schedule invocation which
frees the stack memory.

Move the accounting of the stack memory from release_task_stack() to
exit_task_stack_account() which then can be invoked from do_exit().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-7-bigeasy@linutronix.de
2022-02-22 22:25:02 +01:00
Sebastian Andrzej Siewior
f1c1a9ee00 fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
memcg_charge_kernel_stack() is only used in the CONFIG_VMAP_STACK case.

Move memcg_charge_kernel_stack() into the CONFIG_VMAP_STACK block and
invoke it from within alloc_thread_stack_node().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-6-bigeasy@linutronix.de
2022-02-22 22:25:01 +01:00
Sebastian Andrzej Siewior
7865aba3ad fork: Don't assign the stack pointer in dup_task_struct()
All four versions of alloc_thread_stack_node() assign now
task_struct::stack in case the allocation was successful.

Let alloc_thread_stack_node() return an error code instead of the stack
pointer and remove the stack assignment in dup_task_struct().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-5-bigeasy@linutronix.de
2022-02-22 22:25:01 +01:00
Sebastian Andrzej Siewior
2bb0529c0b fork, IA64: Provide alloc_thread_stack_node() for IA64
Provide a generic alloc_thread_stack_node() for IA64 and
CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets
task_struct::stack so it behaves exactly like the other implementations.

Rename IA64's alloc_thread_stack_node() and add the generic version to the
fork code so it is in one place _and_ to drastically lower the chances of
fat fingering the IA64 code.  Do the same for free_thread_stack().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-4-bigeasy@linutronix.de
2022-02-22 22:25:01 +01:00
Sebastian Andrzej Siewior
546c42b2c5 fork: Duplicate task_struct before stack allocation
alloc_thread_stack_node() already populates the task_struct::stack
member except on IA64. The stack pointer is saved and populated again
because IA64 needs it and arch_dup_task_struct() overwrites it.

Allocate thread's stack after task_struct has been duplicated as a
preparation for further changes.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-3-bigeasy@linutronix.de
2022-02-22 22:25:01 +01:00
Sebastian Andrzej Siewior
be9a2277ca fork: Redo ifdefs around task stack handling
The use of ifdef CONFIG_VMAP_STACK is confusing in terms what is
actually happenning and what can happen.

For instance from reading free_thread_stack() it appears that in the
CONFIG_VMAP_STACK case it may receive a non-NULL vm pointer but it may also
be NULL in which case __free_pages() is used to free the stack.  This is
however not the case because in the VMAP case a non-NULL pointer is always
returned here.  Since it looks like this might happen, the compiler creates
the correct dead code with the invocation to __free_pages() and everything
around it. Twice.

Add spaces between the ifdef and the identifer to recognize the ifdef
level which is currently in scope.

Add the current identifer as a comment behind #else and #endif.
Move the code within free_thread_stack() and alloc_thread_stack_node()
into the relevant ifdef blocks.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220217102406.3697941-2-bigeasy@linutronix.de
2022-02-22 22:25:01 +01:00
Jiapeng Chong
c70cd039f1 cpuset: Fix kernel-doc
Fix the following W=1 kernel warnings:

kernel/cgroup/cpuset.c:3718: warning: expecting prototype for
cpuset_memory_pressure_bump(). Prototype was for
__cpuset_memory_pressure_bump() instead.

kernel/cgroup/cpuset.c:3568: warning: expecting prototype for
cpuset_node_allowed(). Prototype was for __cpuset_node_allowed()
instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22 09:49:49 -10:00
Richard Guy Briggs
272ceeaea3 audit: log AUDIT_TIME_* records only from rules
AUDIT_TIME_* events are generated when there are syscall rules present
that are not related to time keeping.  This will produce noisy log
entries that could flood the logs and hide events we really care about.

Rather than immediately produce the AUDIT_TIME_* records, store the data
in the context and log it at syscall exit time respecting the filter
rules.

Note: This eats the audit_buffer, unlike any others in show_special().

Please see https://bugzilla.redhat.com/show_bug.cgi?id=1991919

Fixes: 7e8eda734d ("ntp: Audit NTP parameters adjustment")
Fixes: 2d87a0674b ("timekeeping: Audit clock adjustments")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: fixed style/whitespace issues]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-02-22 13:51:40 -05:00
Michal Koutný
467a726b75 cgroup-v1: Correct privileges check in release_agent writes
The idea is to check: a) the owning user_ns of cgroup_ns, b)
capabilities in init_user_ns.

The commit 24f6008564 ("cgroup-v1: Require capabilities to set
release_agent") got this wrong in the write handler of release_agent
since it checked user_ns of the opener (may be different from the owning
user_ns of cgroup_ns).
Secondly, to avoid possibly confused deputy, the capability of the
opener must be checked.

Fixes: 24f6008564 ("cgroup-v1: Require capabilities to set release_agent")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Masami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22 08:12:22 -10:00
Christian Brauner
6d3971dab2 cgroup: clarify cgroup_css_set_fork()
With recent fixes for the permission checking when moving a task into a cgroup
using a file descriptor to a cgroup's cgroup.procs file and calling write() it
seems a good idea to clarify CLONE_INTO_CGROUP permission checking with a
comment.

Cc: Tejun Heo <tj@kernel.org>
Cc: <cgroups@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22 07:35:19 -10:00
Jason A. Donenfeld
3191dd5a11 random: clear fast pool, crng, and batches in cpuhp bring up
For the irq randomness fast pool, rather than having to use expensive
atomics, which were visibly the most expensive thing in the entire irq
handler, simply take care of the extreme edge case of resetting count to
zero in the cpuhp online handler, just after workqueues have been
reenabled. This simplifies the code a bit and lets us use vanilla
variables rather than atomics, and performance should be improved.

As well, very early on when the CPU comes up, while interrupts are still
disabled, we clear out the per-cpu crng and its batches, so that it
always starts with fresh randomness.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-02-21 21:14:21 +01:00
Jiapeng Chong
ce06e863f3 printk: make suppress_panic_printk static
This symbol is not used outside of printk.c, so marks it static.

Fix the following sparse warning:

kernel/printk/printk.c💯19: warning: symbol 'suppress_panic_printk'
was not declared. Should it be static?

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220216031957.9761-1-jiapeng.chong@linux.alibaba.com
2022-02-21 16:55:07 +01:00
Andre Kalb
a5a763b2b2 printk: Set console_set_on_cmdline=1 when __add_preferred_console() is called with user_specified == true
In case of using console="" or console=null
set console_set_on_cmdline=1 to disable "stdout-path" node from DT.

We basically need to set it every time when __add_preferred_console()
is called with parameter 'user_specified' set.
Therefore we can move setting it into a helper function that is
called from __add_preferred_console().

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andre Kalb <andre.kalb@sma.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/YgzU4ho8l6XapyG2@pc6682
2022-02-21 15:52:08 +01:00
Ingo Molnar
6255b48aeb Linux 5.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
 Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
 wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
 qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
 1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
 fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
 /PuMhEg=
 =eU1o
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts

New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-21 11:53:51 +01:00
Thomas Gleixner
d2206fcabd Provide a tag for maintainers to pull the generic_handle_irq_safe() API.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmITansTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQObEACuqO4QewDX0s37vfKkB1Mq+FSQRCm0
 jm/D9wawJwRdvoMWIuo9jGKPzc6qfRTjO6DO3WZ1yCQyiXDhLpWgGT3pHzgtjkY2
 zP2OZond6ozXlIdHLV5RqRDoXjoz3G4Ttxjsua0Z21KwRJusgTE8Q7DZNF2URvlL
 kAyfwd2R4U6Zpzouid5ezYDphNFAqU2XZlI1Kz9YYPzo8uBQk9ADqqeJznV7JxE6
 P/ANkTBCZRFBEDawcNIV9Wy76nTWYWhilnbmPoUHcIgoULRGlzcMGTUnPzxo1fkD
 P3NH8ekzoXxFkPJl6SGvR3JOAHBJoxW+mebquohuwrL6i4x960Gp2ZrTu3BFQ194
 HEXyIN3XYJaghoz4y1u4Lnoq0rfx4OupxNUJ36Bnv7+XVhqVFXnchWGSHpUdEc9K
 eGCUlsD9f+vg8iKTCWWboAjlR8lXlARiwJZ4lQfQWAvTN5ibU43TokAQyNbb8PtW
 W6mTEDPmvot2e109bP5b7NePQxmdv8dMNHcGyWsBx8sUlRiCUOouBhHp9OKhVVLz
 eBQwA3OZJLw6M4E/YTz3L9yylGFt239jdcJUugieCuXJnyoNu0NelqSkvWocdE7M
 PvtCNlNrK1j2xk6kykX/xI14Ma/DzjhSGXGOYoqwcz1MydbowKAAqRR36+kPOb5q
 nZt2j4+pQGPAvQ==
 =UYWR
 -----END PGP SIGNATURE-----

Merge tag 'irq-api-2022-02-21' into irq/core

Merge the generic_handle_irq_safe() API back into irq/core.
2022-02-21 11:33:57 +01:00
Sebastian Andrzej Siewior
509853f9e1 genirq: Provide generic_handle_irq_safe()
Provide generic_handle_irq_safe() which can used from any context.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20220211181500.1856198-2-bigeasy@linutronix.de
2022-02-21 11:31:06 +01:00
Josh Poimboeuf
44a3918c82 x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.

When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21 10:21:47 +01:00
Linus Torvalds
3324e6e803 - Fix a NULL ptr dereference when dumping lockdep chains through /proc/lockdep_chains
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmISMycACgkQEsHwGGHe
 VUrZ7w/+M/Fs4k2lkZhknLTIstg01cjI1Za+K1wQI4nPp0zRWjeR28rGQ5uEGfM7
 LgfdKGFh/SfPBnjasSF2yoejW+V1K4G/NGMD3xtzPAG8D8G44KsyzCXcbBcF6LyT
 dZTtexWqxQQ/A+ispBRK4eYpO4q7Z0ir+Wiwgt1Svlwq/Is3FT3H1bOyWeUpLZZY
 9p0liUmdX9JJyN6gmVWh9UfBwHdTi+ThUMMtFjvV/dOqgBN0VNDLQoPedXMMjcp/
 essLYswz26MVUF2mdnJ5K1a0FugKd/HWYDAaUhd9LLHFWQKR2/bUMPpy8/Zps4LA
 i8uevabCs5dX2eyfKxm9D1OMm8f3pXpzNuzZIW1txIR4+ofoLR7rPY9jCMsLhRAO
 F8YVrU815pp8ko2j2Hp2Whq4p3hNd0kprWe5hRiZXV3VnxEnFjSlBozJyfNohLov
 +EyYztPwVRHTIOVIuy0c448a5PYFo4zi4jsEw49gdYN4Jd7LvTGzNc7rY1Swb1Xy
 s9rEBbr+rVdOWCkAOQyjEc6zaTI7WTEjtHCs1udsq1aQ/5rV7mMGBAPljZCBZVn3
 gFTjISxXw8EAMNJ/c0j9++X3MdSa9W1ZKncKItUdmoKITdGzzDul6SWZSk/H8RxI
 KaFK/k22ebiVdUIyr3qa2n8cC59UyMz3kWFFGUUqb+AkMRm9pjQ=
 =Gc6E
 -----END PGP SIGNATURE-----

Merge tag 'locking_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:
 "Fix a NULL ptr dereference when dumping lockdep chains through
  /proc/lockdep_chains"

* tag 'locking_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Correct lock_classes index mapping
2022-02-20 12:50:50 -08:00
Linus Torvalds
0b0894ff78 - Fix task exposure order when forking tasks
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmISLs0ACgkQEsHwGGHe
 VUrq+g/+NxLVl3QTmteNh4jEzA/3FDccRKbALNs09YX116nCOZ1sv1Z5FgtdOLUT
 ukIujAm1SKDXQrXoglmm3eH6ylWZ54DQo6qrP+vzavtIKSgbt+vPS63qDR4ehF8N
 Kh04qssWy+AzOSeZRQ6lXTiGei2nBKFbiwlwPnV5vbQmLBFvzv2dEErVFYkHwWN6
 foygRUOhslv5aizwpDriAvDq9ZTVUPXqzIi5zWnTON8y8Vy32eZjSJKez8SQH57w
 vbEZkJTw33tCzG/5+J5mh3UmP9Mcj34W/GDCKHxjO1Y38SbLTMy2Agl5hbRKjVZ8
 W4wElyXyXtLx1RsrZpOJ90mhqwADcxLgkq4ipl6uzikeQlQVnOOBsHZOi3NkTorg
 1YhpxZhqzWWotdGUQwzfZYSuoqQAyUxeqZKwqUD+dYsEqQWhTWSa41IWcfz+r8UD
 uD+pO03EbAEzBB+BSJL9xh0xnchdOnHbpTSFo1AJeQ+LUKanXxO2tXk9/3SwliMj
 M751vPtQH6DiVukSfywxJym+aaLbjeju6RvA69Atap51roYXPSsaCHGDwLeNUS+w
 5D6KC0IuCxsliLpAD4c4PXkRdTvwnG/0lUt+s83bGnyfh5nQzfXXbBp3I5+o0Fcq
 jIQ37uH2Mj5MKtmC8o4ekCtNTtWM8QNZvKzMzoju616Wqee0Ag4=
 =RagE
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "Fix task exposure order when forking tasks"

* tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix yet more sched_fork() races
2022-02-20 12:40:20 -08:00
Linus Torvalds
c1034d249d pidfd.v5.17-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYhDKMgAKCRCRxhvAZXjc
 opUwAPwORv7MD8rh5va7LFWUxX1UFpxVILWcC1umuhHAOKZ7YAEA3DTYFrQZhxA2
 nMBR6hBEDKRARIFv3zHZYflYK97FnQA=
 =NPXo
 -----END PGP SIGNATURE-----

Merge tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd fix from Christian Brauner:
 "This fixes a problem reported by lockdep when installing a pidfd via
  fd_install() with siglock and the tasklisk write lock held in
  copy_process() when calling clone()/clone3() with CLONE_PIDFD.

  Originally a pidfd was created prior to holding any of these locks but
  this required a call to ksys_close(). So quite some time ago in
  6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups") we
  switched to a get_unused_fd_flags() + fd_install() model.

  As part of that we moved fd_install() as late as possible. This was
  done for two main reasons. First, because we needed to ensure that we
  call fd_install() past the point of no return as once that's called
  the fd is live in the task's file table. Second, because we tried to
  ensure that the fd is visible in /proc/<pid>/fd/<pidfd> right when the
  task is visible.

  This fix moves the fd_install() to an even later point which means
  that a task will be visible in proc while the pidfd isn't yet under
  /proc/<pid>/fd/<pidfd>.

  While this is a user visible change it's very unlikely that this will
  have any impact. Nobody should be relying on that and if they do we
  need to come up with something better but again, it's doubtful this is
  relevant"

* tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  copy_process(): Move fd_install() out of sighand->siglock critical section
2022-02-20 10:55:05 -08:00
Linus Torvalds
2d3409ebc8 Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fixes from Eric Biederman:
 "Michal Koutný recently found some bugs in the enforcement of
  RLIMIT_NPROC in the recent ucount rlimit implementation.

  In this set of patches I have developed a very conservative approach
  changing only what is necessary to fix the bugs that I can see
  clearly. Cleanups and anything that is making the code more consistent
  can follow after we have the code working as it has historically.

  The problem is not so much inconsistencies (although those exist) but
  that it is very difficult to figure out what the code should be doing
  in the case of RLIMIT_NPROC.

  All other rlimits are only enforced where the resource is acquired
  (allocated). RLIMIT_NPROC by necessity needs to be enforced in an
  additional location, and our current implementation stumbled it's way
  into that implementation"

* 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Handle wrapping in is_ucounts_overlimit
  ucounts: Move RLIMIT_NPROC handling after set_user
  ucounts: Base set_cred_ucounts changes on the real user
  ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
  rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
2022-02-20 10:44:11 -08:00
Souptick Joarder (HPE)
d0b3822902 bpf: Initialize ret to 0 inside btf_populate_kfunc_set()
Kernel test robot reported below error ->

kernel/bpf/btf.c:6718 btf_populate_kfunc_set()
error: uninitialized symbol 'ret'.

Initialize ret to 0.

Fixes: dee872e124 ("bpf: Populate kfunc BTF ID sets in struct btf")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Souptick Joarder (HPE) <jrdr.linux@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20220219163915.125770-1-jrdr.linux@gmail.com
2022-02-19 17:26:52 -08:00
Mark Rutland
99cf983cc8 sched/preempt: Add PREEMPT_DYNAMIC using static keys
Where an architecture selects HAVE_STATIC_CALL but not
HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
which will either branch to a callee or return to the caller.

On such architectures, a number of constraints can conspire to make
those trampolines more complicated and potentially less useful than we'd
like. For example:

* Hardware and software control flow integrity schemes can require the
  addition of "landing pad" instructions (e.g. `BTI` for arm64), which
  will also be present at the "real" callee.

* Limited branch ranges can require that trampolines generate or load an
  address into a register and perform an indirect branch (or at least
  have a slow path that does so). This loses some of the benefits of
  having a direct branch.

* Interaction with SW CFI schemes can be complicated and fragile, e.g.
  requiring that we can recognise idiomatic codegen and remove
  indirections understand, at least until clang proves more helpful
  mechanisms for dealing with this.

For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
really only need to enable/disable specific preemption functions. We can
achieve the same effect without a number of the pain points above by
using static keys to fold early returns into the preemption functions
themselves rather than in an out-of-line trampoline, effectively
inlining the trampoline into the start of the function.

For arm64, this results in good code generation. For example, the
dynamic_cond_resched() wrapper looks as follows when enabled. When
disabled, the first `B` is replaced with a `NOP`, resulting in an early
return.

| <dynamic_cond_resched>:
|        bti     c
|        b       <dynamic_cond_resched+0x10>     // or `nop`
|        mov     w0, #0x0
|        ret
|        mrs     x0, sp_el0
|        ldr     x0, [x0, #8]
|        cbnz    x0, <dynamic_cond_resched+0x8>
|        paciasp
|        stp     x29, x30, [sp, #-16]!
|        mov     x29, sp
|        bl      <preempt_schedule_common>
|        mov     w0, #0x1
|        ldp     x29, x30, [sp], #16
|        autiasp
|        ret

... compared to the regular form of the function:

| <__cond_resched>:
|        bti     c
|        mrs     x0, sp_el0
|        ldr     x1, [x0, #8]
|        cbz     x1, <__cond_resched+0x18>
|        mov     w0, #0x0
|        ret
|        paciasp
|        stp     x29, x30, [sp, #-16]!
|        mov     x29, sp
|        bl      <preempt_schedule_common>
|        mov     w0, #0x1
|        ldp     x29, x30, [sp], #16
|        autiasp
|        ret

Any architecture which implements static keys should be able to use this
to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
calls. Since this is likely to have greater overhead than (inlined)
static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
HAVE_PREEMPT_DYNAMIC_CALL is selected.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
2022-02-19 11:11:08 +01:00
Mark Rutland
33c64734be sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY
Now that the enabled/disabled states for the preemption functions are
declared alongside their definitions, the core PREEMPT_DYNAMIC logic is
no longer tied to GENERIC_ENTRY, and can safely be selected so long as
an architecture provides enabled/disabled states for
irqentry_exit_cond_resched().

Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY.

For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional
change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-5-mark.rutland@arm.com
2022-02-19 11:11:08 +01:00
Mark Rutland
4624a14f4d sched/preempt: Simplify irqentry_exit_cond_resched() callers
Currently callers of irqentry_exit_cond_resched() need to be aware of
whether the function should be indirected via a static call, leading to
ugly ifdeffery in callers.

Save them the hassle with a static inline wrapper that does the right
thing. The raw_irqentry_exit_cond_resched() will also be useful in
subsequent patches which will add conditional wrappers for preemption
functions.

Note: in arch/x86/entry/common.c, xen_pv_evtchn_do_upcall() always calls
irqentry_exit_cond_resched() directly, even when PREEMPT_DYNAMIC is in
use. I believe this is a latent bug (which this patch corrects), but I'm
not entirely certain this wasn't deliberate.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-4-mark.rutland@arm.com
2022-02-19 11:11:08 +01:00
Mark Rutland
8a69fe0be1 sched/preempt: Refactor sched_dynamic_update()
Currently sched_dynamic_update needs to open-code the enabled/disabled
function names for each preemption model it supports, when in practice
this is a boolean enabled/disabled state for each function.

Make this clearer and avoid repetition by defining the enabled/disabled
states at the function definition, and using helper macros to perform the
static_call_update(). Where x86 currently overrides the enabled
function, it is made to provide both the enabled and disabled states for
consistency, with defaults provided by the core code otherwise.

In subsequent patches this will allow us to support PREEMPT_DYNAMIC
without static calls.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com
2022-02-19 11:11:07 +01:00
Mark Rutland
4c7485584d sched/preempt: Move PREEMPT_DYNAMIC logic later
The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls
for a bunch of preemption functions. While most are defined prior to
this, the definition of cond_resched() is later in the file, and so we
only have its declarations from include/linux/sched.h.

In subsequent patches we'd like to define some macros alongside the
definition of each of the preemption functions, which we can use within
sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC
logic needs to be placed after the various preemption functions.

As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after
the various preemption functions, with no other changes -- this is
purely a move.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220214165216.2231574-2-mark.rutland@arm.com
2022-02-19 11:11:07 +01:00
Peter Zijlstra
b1e8206582 sched: Fix yet more sched_fork() races
Where commit 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.

Commit 13765de814 ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.

Fixes: 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
2022-02-19 11:11:05 +01:00
Eric Dumazet
9087c6ff8d bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()
As stated in the comment found in maybe_wait_bpf_programs(),
the synchronize_rcu() barrier is only needed before returning
to userspace, not after each deletion in the batch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20220218181801.2971275-1-eric.dumazet@gmail.com
2022-02-18 20:37:26 +01:00
Paolo Bonzini
5be2226f41 KVM: x86: allow defining return-0 static calls
A few vendor callbacks are only used by VMX, but they return an integer
or bool value.  Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18 12:44:22 -05:00
Jakub Kicinski
a3fc4b1d09 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2022-02-17

We've added 29 non-merge commits during the last 8 day(s) which contain
a total of 34 files changed, 1502 insertions(+), 524 deletions(-).

The main changes are:

1) Add BTFGen support to bpftool which allows to use CO-RE in kernels without
   BTF info, from Mauricio Vásquez, Rafael David Tinoco, Lorenzo Fontana and
   Leonardo Di Donato. (Details: https://lpc.events/event/11/contributions/948/)

2) Prepare light skeleton to be used in both kernel module and user space
   and convert bpf_preload.ko to use light skeleton, from Alexei Starovoitov.

3) Rework bpftool's versioning scheme and align with libbpf's version number;
   also add linked libbpf version info to "bpftool version", from Quentin Monnet.

4) Add minimal C++ specific additions to bpftool's skeleton codegen to
   facilitate use of C skeletons in C++ applications, from Andrii Nakryiko.

5) Add BPF verifier sanity check whether relative offset on kfunc calls overflows
   desc->imm and reject the BPF program if the case, from Hou Tao.

6) Fix libbpf to use a dynamically allocated buffer for netlink messages to
   avoid receiving truncated messages on some archs, from Toke Høiland-Jørgensen.

7) Various follow-up fixes to the JIT bpf_prog_pack allocator, from Song Liu.

8) Various BPF selftest and vmtest.sh fixes, from Yucong Sun.

9) Fix bpftool pretty print handling on dumping map keys/values when no BTF
   is available, from Jiri Olsa and Yinjun Zhang.

10) Extend XDP frags selftest to check for invalid length, from Lorenzo Bianconi.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (29 commits)
  bpf: bpf_prog_pack: Set proper size before freeing ro_header
  selftests/bpf: Fix crash in core_reloc when bpftool btfgen fails
  selftests/bpf: Fix vmtest.sh to launch smp vm.
  libbpf: Fix memleak in libbpf_netlink_recv()
  bpftool: Fix C++ additions to skeleton
  bpftool: Fix pretty print dump for maps without BTF loaded
  selftests/bpf: Test "bpftool gen min_core_btf"
  bpftool: Gen min_core_btf explanation and examples
  bpftool: Implement btfgen_get_btf()
  bpftool: Implement "gen min_core_btf" logic
  bpftool: Add gen min_core_btf command
  libbpf: Expose bpf_core_{add,free}_cands() to bpftool
  libbpf: Split bpf_core_apply_relo()
  bpf: Reject kfunc calls that overflow insn->imm
  selftests/bpf: Add Skeleton templated wrapper as an example
  bpftool: Add C++-specific open/load/etc skeleton wrappers
  selftests/bpf: Fix GCC11 compiler warnings in -O2 mode
  bpftool: Fix the error when lookup in no-btf maps
  libbpf: Use dynamically allocated buffer when receiving netlink messages
  libbpf: Fix libbpf.map inheritance chain for LIBBPF_0.7.0
  ...
====================

Link: https://lore.kernel.org/r/20220217232027.29831-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17 17:23:52 -08:00
Song Liu
d24d2a2b0a bpf: bpf_prog_pack: Set proper size before freeing ro_header
bpf_prog_pack_free() uses header->size to decide whether the header
should be freed with module_memfree() or the bpf_prog_pack logic.
However, in kvmalloc() failure path of bpf_jit_binary_pack_alloc(),
header->size is not set yet. As a result, bpf_prog_pack_free() may treat
a slice of a pack as a standalone kvmalloc'd header and call
module_memfree() on the whole pack. This in turn causes use-after-free by
other users of the pack.

Fix this by setting ro_header->size before freeing ro_header.

Fixes: 33c9805860 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]")
Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com
Reported-by: syzbot+ecb1e7e51c52f68f7481@syzkaller.appspotmail.com
Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220217183001.1876034-1-song@kernel.org
2022-02-17 13:15:36 -08:00
Jakub Kicinski
93d11e0d76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Fast path bpf marge for some -next work.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17 12:22:28 -08:00
Jakub Kicinski
7a2fb91285 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2022-02-17

We've added 8 non-merge commits during the last 7 day(s) which contain
a total of 8 files changed, 119 insertions(+), 15 deletions(-).

The main changes are:

1) Add schedule points in map batch ops, from Eric.

2) Fix bpf_msg_push_data with len 0, from Felix.

3) Fix crash due to incorrect copy_map_value, from Kumar.

4) Fix crash due to out of bounds access into reg2btf_ids, from Kumar.

5) Fix a bpf_timer initialization issue with clang, from Yonghong.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add schedule points in batch ops
  bpf: Fix crash due to out of bounds access into reg2btf_ids.
  selftests: bpf: Check bpf_msg_push_data return value
  bpf: Fix a bpf_timer initialization issue
  bpf: Emit bpf_timer in vmlinux BTF
  selftests/bpf: Add test for bpf_timer overwriting crash
  bpf: Fix crash due to incorrect copy_map_value
  bpf: Do not try bpf_msg_push_data with len 0
====================

Link: https://lore.kernel.org/r/20220217190000.37925-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17 12:01:55 -08:00
Jakub Kicinski
6b5567b1b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17 11:44:20 -08:00
Eric Dumazet
75134f16e7 bpf: Add schedule points in batch ops
syzbot reported various soft lockups caused by bpf batch operations.

 INFO: task kworker/1:1:27 blocked for more than 140 seconds.
 INFO: task hung in rcu_barrier

Nothing prevents batch ops to process huge amount of data,
we need to add schedule points in them.

Note that maybe_wait_bpf_programs(map) calls from
generic_map_delete_batch() can be factorized by moving
the call after the loop.

This will be done later in -next tree once we get this fix merged,
unless there is strong opinion doing this optimization sooner.

Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Brian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/bpf/20220217181902.808742-1-eric.dumazet@gmail.com
2022-02-17 10:48:26 -08:00
Hugh Dickins
cea86fe246 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17 11:56:48 -05:00
Eric W. Biederman
0cbae9e24f ucounts: Handle wrapping in is_ucounts_overlimit
While examining is_ucounts_overlimit and reading the various messages
I realized that is_ucounts_overlimit fails to deal with counts that
may have wrapped.

Being wrapped should be a transitory state for counts and they should
never be wrapped for long, but it can happen so handle it.

Cc: stable@vger.kernel.org
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Link: https://lkml.kernel.org/r/20220216155832.680775-5-ebiederm@xmission.com
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17 09:11:57 -06:00
Eric W. Biederman
c923a8e7ed ucounts: Move RLIMIT_NPROC handling after set_user
During set*id() which cred->ucounts to charge the the current process
to is not known until after set_cred_ucounts.  So move the
RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call
flag_nproc_exceeded after set_cred_ucounts.

This is very much an arbitrary subset of the places where we currently
change the RLIMIT_NPROC accounting, designed to preserve the existing
logic.

Fixing the existing logic will be the subject of another series of
changes.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220216155832.680775-4-ebiederm@xmission.com
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17 09:11:26 -06:00
Eric W. Biederman
a55d07294f ucounts: Base set_cred_ucounts changes on the real user
Michal Koutný <mkoutny@suse.com> wrote:
> Tasks are associated to multiple users at once. Historically and as per
> setrlimit(2) RLIMIT_NPROC is enforce based on real user ID.
>
> The commit 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
> made the accounting structure "indexed" by euid and hence potentially
> account tasks differently.
>
> The effective user ID may be different e.g. for setuid programs but
> those are exec'd into already existing task (i.e. below limit), so
> different accounting is moot.
>
> Some special setresuid(2) users may notice the difference, justifying
> this fix.

I looked at cred->ucount and it is only used for rlimit operations
that were previously stored in cred->user.  Making the fact
cred->ucount can refer to a different user from cred->user a bug,
affecting all uses of cred->ulimit not just RLIMIT_NPROC.

Fix set_cred_ucounts to always use the real uid not the effective uid.

Further simplify set_cred_ucounts by noticing that set_cred_ucounts
somehow retained a draft version of the check to see if alloc_ucounts
was needed that checks the new->user and new->user_ns against the
current_real_cred().  Remove that draft version of the check.

All that matters for setting the cred->ucounts are the user_ns and uid
fields in the cred.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220207121800.5079-4-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220216155832.680775-3-ebiederm@xmission.com
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17 09:11:02 -06:00
Eric W. Biederman
8f2f9c4d82 ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
Michal Koutný <mkoutny@suse.com> wrote:

> It was reported that v5.14 behaves differently when enforcing
> RLIMIT_NPROC limit, namely, it allows one more task than previously.
> This is consequence of the commit 21d1c5e386 ("Reimplement
> RLIMIT_NPROC on top of ucounts") that missed the sharpness of
> equality in the forking path.

This can be fixed either by fixing the test or by moving the increment
to be before the test.  Fix it my moving copy_creds which contains
the increment before is_ucounts_overlimit.

In the case of CLONE_NEWUSER the ucounts in the task_cred changes.
The function is_ucounts_overlimit needs to use the final version of
the ucounts for the new process.  Which means moving the
is_ucounts_overlimit test after copy_creds is necessary.

Both the test in fork and the test in set_user were semantically
changed when the code moved to ucounts.  The change of the test in
fork was bad because it was before the increment.  The test in
set_user was wrong and the change to ucounts fixed it.  So this
fix only restores the old behavior in one lcation not two.

Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com
Link: https://lkml.kernel.org/r/20220216155832.680775-2-ebiederm@xmission.com
Cc: stable@vger.kernel.org
Reported-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17 09:10:33 -06:00
Eric W. Biederman
c16bdeb5a3 rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
Solar Designer <solar@openwall.com> wrote:
> I'm not aware of anyone actually running into this issue and reporting
> it.  The systems that I personally know use suexec along with rlimits
> still run older/distro kernels, so would not yet be affected.
>
> So my mention was based on my understanding of how suexec works, and
> code review.  Specifically, Apache httpd has the setting RLimitNPROC,
> which makes it set RLIMIT_NPROC:
>
> https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc
>
> The above documentation for it includes:
>
> "This applies to processes forked from Apache httpd children servicing
> requests, not the Apache httpd children themselves. This includes CGI
> scripts and SSI exec commands, but not any processes forked from the
> Apache httpd parent, such as piped logs."
>
> In code, there are:
>
> ./modules/generators/mod_cgid.c:        ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/generators/mod_cgi.c:        ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC,
> ./modules/filters/mod_ext_filter.c:    rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc);
>
> For example, in mod_cgi.c this is in run_cgi_child().
>
> I think this means an httpd child sets RLIMIT_NPROC shortly before it
> execs suexec, which is a SUID root program.  suexec then switches to the
> target user and execs the CGI script.
>
> Before 2863643fb8, the setuid() in suexec would set the flag, and the
> target user's process count would be checked against RLIMIT_NPROC on
> execve().  After 2863643fb8, the setuid() in suexec wouldn't set the
> flag because setuid() is (naturally) called when the process is still
> running as root (thus, has those limits bypass capabilities), and
> accordingly execve() would not check the target user's process count
> against RLIMIT_NPROC.

In commit 2863643fb8 ("set_user: add capability check when
rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to
make it more consistent with fork.  Unfortunately because of call site
differences those capable calls were checking the credentials of the
user before set*id() instead of after set*id().

This breaks enforcement of RLIMIT_NPROC for applications that set the
rlimit and then call set*id() while holding a full set of
capabilities.  The capabilities are only changed in the new credential
in security_task_fix_setuid().

The code in apache suexec appears to follow this pattern.

Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits
(RLIMIT_NPROC)") where this check was added describes the targes of this
capability check as:

  2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then
      calls setuid, the setuid should fail if the user would then be running
      more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't.  This patch
      adds an appropriate test.  With this patch, and per-user process limit
      imposed in cgiwrap really works.

So the original use case of this check also appears to match the broken
pattern.

Restore the enforcement of RLIMIT_NPROC by removing the bad capable
checks added in set_user.  This unfortunately restores the
inconsistent state the code has been in for the last 11 years, but
dealing with the inconsistencies looks like a larger problem.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20210907213042.GA22626@openwall.com/
Link: https://lkml.kernel.org/r/20220212221412.GA29214@openwall.com
Link: https://lkml.kernel.org/r/20220216155832.680775-1-ebiederm@xmission.com
Fixes: 2863643fb8 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reviewed-by: Solar Designer <solar@openwall.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-17 09:08:05 -06:00
Dmitry Torokhov
a8e8f851e8 module: fix building with sysfs disabled
Sysfs support might be disabled so we need to guard the code that
instantiates "compression" attribute with an #ifdef.

Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-02-16 12:51:32 -08:00
Kumar Kartikeya Dwivedi
45ce4b4f90 bpf: Fix crash due to out of bounds access into reg2btf_ids.
When commit e6ac2450d6 ("bpf: Support bpf program calling kernel function") added
kfunc support, it defined reg2btf_ids as a cheap way to translate the verifier
reg type to the appropriate btf_vmlinux BTF ID, however
commit c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
moved the __BPF_REG_TYPE_MAX from the last member of bpf_reg_type enum to after
the base register types, and defined other variants using type flag
composition. However, now, the direct usage of reg->type to index into
reg2btf_ids may no longer fall into __BPF_REG_TYPE_MAX range, and hence lead to
out of bounds access and kernel crash on dereference of bad pointer.

Fixes: c25b2ae136 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220216201943.624869-1-memxor@gmail.com
2022-02-16 12:46:51 -08:00
Ye Bin
3f51aa9e29 PM: hibernate: fix load_image_and_restore() error path
As 'swsusp_check' open 'hib_resume_bdev', if call 'create_basic_memory_bitmaps'
failed, we need to close 'hib_resume_bdev' in 'load_image_and_restore' function.

Signed-off-by: Ye Bin <yebin10@huawei.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-02-16 19:47:52 +01:00
Mauricio Vásquez
adb8fa195e libbpf: Split bpf_core_apply_relo()
BTFGen needs to run the core relocation logic in order to understand
what are the types involved in a given relocation.

Currently bpf_core_apply_relo() calculates and **applies** a relocation
to an instruction. Having both operations in the same function makes it
difficult to only calculate the relocation without patching the
instruction. This commit splits that logic in two different phases: (1)
calculate the relocation and (2) patch the instruction.

For the first phase bpf_core_apply_relo() is renamed to
bpf_core_calc_relo_insn() who is now only on charge of calculating the
relocation, the second phase uses the already existing
bpf_core_patch_insn(). bpf_object__relocate_core() uses both of them and
the BTFGen will use only bpf_core_calc_relo_insn().

Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rafael David Tinoco <rafael.tinoco@aquasec.com>
Signed-off-by: Lorenzo Fontana <lorenzo.fontana@elastic.co>
Signed-off-by: Leonardo Di Donato <leonardo.didonato@elastic.co>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220215225856.671072-2-mauricio@kinvolk.io
2022-02-16 10:05:42 -08:00
Waiman Long
fb7275acd6 locking/lockdep: Iterate lock_classes directly when reading lockdep files
When dumping lock_classes information via /proc/lockdep, we can't take
the lockdep lock as the lock hold time is indeterminate. Iterating
over all_lock_classes without holding lock can be dangerous as there
is a slight chance that it may branch off to other lists leading to
infinite loop or even access invalid memory if changes are made to
all_lock_classes list in parallel.

To avoid this problem, iteration of lock classes is now done directly
on the lock_classes array itself. The lock_classes_in_use bitmap is
checked to see if the lock class is being used. To avoid iterating
the full array all the times, a new max_lock_class_idx value is added
to track the maximum lock_class index that is currently being used.

We can theoretically take the lockdep lock for iterating all_lock_classes
when other lockdep files (lockdep_stats and lock_stat) are accessed as
the lock hold time will be shorter for them. For consistency, they are
also modified to iterate the lock_classes array directly.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220211035526.1329503-2-longman@redhat.com
2022-02-16 15:57:58 +01:00
Frederic Weisbecker
ed3b362d54 sched/isolation: Split housekeeping cpumask per isolation features
To prepare for supporting each housekeeping feature toward cpuset, split
the global housekeeping cpumask per HK_TYPE_* entry.

This will later allow, for example, to runtime modify the cpulist passed
through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
parameters.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-9-frederic@kernel.org
2022-02-16 15:57:56 +01:00
Frederic Weisbecker
65e53f869e sched/isolation: Fix housekeeping_mask memory leak
If "nohz_full=" or "isolcpus=nohz" are called with CONFIG_NO_HZ_FULL=n,
housekeeping_mask doesn't get freed despite it being unused if
housekeeping_setup() is called for the first time.

Check this scenario first to fix this, so that no useless allocation
is performed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-8-frederic@kernel.org
2022-02-16 15:57:56 +01:00
Frederic Weisbecker
0cd3e59de1 sched/isolation: Consolidate error handling
Centralize the mask freeing and return value for the error path. This
makes potential leaks more visible.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-7-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Frederic Weisbecker
6367b600e3 sched/isolation: Consolidate check for housekeeping minimum service
There can be two subsequent calls to housekeeping_setup() due to
"nohz_full=" and "isolcpus=" that can mix up.  The two passes each have
their own way to deal with an empty housekeeping set of CPUs.
Consolidate this part and remove the awful "tmp" based naming.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-6-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Frederic Weisbecker
04d4e665a6 sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Frederic Weisbecker
7b45b51e77 workqueue: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch
To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare each of the HK_FLAG_* entries to move to their
own cpumask with enforcing to fetch them individually. The new
constraint is that multiple HK_FLAG_* entries can't be mixed together
anymore in a single call to housekeeping cpumask().

This will later allow, for example, to runtime modify the cpulist passed
through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
parameters.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220207155910.527133-3-frederic@kernel.org
2022-02-16 15:57:54 +01:00
Zhaoyang Huang
e6df4ead85 psi: fix possible trigger missing in the window
When a new threshold breaching stall happens after a psi event was
generated and within the window duration, the new event is not
generated because the events are rate-limited to one per window. If
after that no new stall is recorded then the event will not be
generated even after rate-limiting duration has passed. This is
happening because with no new stall, window_update will not be called
even though threshold was previously breached. To fix this, record
threshold breaching occurrence and generate the event once window
duration is passed.

Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com
2022-02-16 15:57:54 +01:00
Huang Ying
5c7b1aaf13 sched/numa: Avoid migrating task to CPU-less node
In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
2022-02-16 15:57:53 +01:00
Huang Ying
0fb3978b0a sched/numa: Fix NUMA topology for systems with CPU-less nodes
The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.

For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,

[000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
[004h 0004   4]                 Table Length : 0000042C
[008h 0008   1]                     Revision : 01
[009h 0009   1]                     Checksum : 59
[00Ah 0010   6]                       Oem ID : "XXXX"
[010h 0016   8]                 Oem Table ID : "XXXXXXX"
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : "INTL"
[020h 0032   4]        Asl Compiler Revision : 20091013

[024h 0036   8]                   Localities : 0000000000000004
[02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
[030h 0048   4]                 Locality   1 : 15 0A 1C 11
[034h 0052   4]                 Locality   2 : 11 1C 0A 1C
[038h 0056   4]                 Locality   3 : 1C 11 1C 0A

While the `numactl -H` output is as follows,

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node   0   1   2   3
  0:  10  21  17  28
  1:  21  10  28  17
  2:  17  28  10  28
  3:  28  17  28  10

In this system, there are only 2 sockets.  In each memory controller,
both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex.  Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3.  And the whole system appears to be a glueless mesh NUMA
topology type.  But it's definitely not, there is even no CPU in node 3.

This isn't a practical problem now yet.  Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot.  So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in.  And there's no CPU in the
PMEM nodes.  But it appears better to fix this to make the code more
robust.

To test the potential problem.  We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()).  With that, the NUMA parameters
identified by scheduler is as follows,

sched_numa_topology_type:	NUMA_GLUELESS_MESH
sched_domains_numa_levels:	4
sched_max_numa_distance:	28

To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified.  Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.

With the patch, the NUMA parameters identified for the example system
above is as follows,

sched_numa_topology_type:	NUMA_DIRECT
sched_domains_numa_levels:	2
sched_max_numa_distance:	21

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
2022-02-16 15:57:53 +01:00
Yury Norov
1087ad4e3f sched: replace cpumask_weight with cpumask_empty where appropriate
In some places, kernel/sched code calls cpumask_weight() to check if
any bit of a given cpumask is set. We can do it more efficiently with
cpumask_empty() because cpumask_empty() stops traversing the cpumask as
soon as it finds first set bit, while cpumask_weight() counts all bits
unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com
2022-02-16 15:57:53 +01:00
Christophe Leroy
b64913394f lkdtm: Really write into kernel text in WRITE_KERN
WRITE_KERN is supposed to overwrite some kernel text, namely
do_overwritten() function.

But at the time being it overwrites do_overwritten() function
descriptor, not function text.

Fix it by dereferencing the function descriptor to obtain
function text pointer. Export dereference_function_descriptor()
for when LKDTM is built as a module.

And make do_overwritten() noinline so that it is really
do_overwritten() which is called by lkdtm_WRITE_KERN().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/31e58eaffb5bc51c07d8d4891d1982100ade8cfc.1644928018.git.christophe.leroy@csgroup.eu
2022-02-16 23:25:12 +11:00
Christophe Leroy
e1478d8eaf asm-generic: Refactor dereference_[kernel]_function_descriptor()
dereference_function_descriptor() and
dereference_kernel_function_descriptor() are identical on the
three architectures implementing them.

Make them common and put them out-of-line in kernel/extable.c
which is one of the users and has similar type of functions.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/449db09b2eba57f4ab05f80102a67d8675bc8bcd.1644928018.git.christophe.leroy@csgroup.eu
2022-02-16 23:25:11 +11:00
Hou Tao
8cbf062a25 bpf: Reject kfunc calls that overflow insn->imm
Now kfunc call uses s32 to represent the offset between the address of
kfunc and __bpf_call_base, but it doesn't check whether or not s32 will
be overflowed. The overflow is possible when kfunc is in module and the
offset between module and kernel is greater than 2GB. Take arm64 as an
example, before commit b2eed9b588 ("arm64/kernel: kaslr: reduce module
randomization range to 2 GB"), the offset between module symbol and
__bpf_call_base will in 4GB range due to KASLR and may overflow s32.

So add an extra checking to reject these invalid kfunc calls.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220215065732.3179408-1-houtao1@huawei.com
2022-02-15 10:05:11 -08:00
John Ogness
2ba3673d70 printk: use atomic updates for klogd work
The per-cpu @printk_pending variable can be updated from
sleepable contexts, such as:

  get_random_bytes()
    warn_unseeded_randomness()
      printk_deferred()
        defer_console_output()

and can be updated from interrupt contexts, such as:

  handle_irq_event_percpu()
    __irq_wake_thread()
      wake_up_process()
        try_to_wake_up()
          select_task_rq()
            select_fallback_rq()
              printk_deferred()
                defer_console_output()

and can be updated from NMI contexts, such as:

  vprintk()
    if (in_nmi()) defer_console_output()

Therefore the atomic variant of the updating functions must be used.

Replace __this_cpu_xchg() with this_cpu_xchg().
Replace __this_cpu_or() with this_cpu_or().

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87iltld4ue.fsf@jogness.linutronix.de
2022-02-15 16:10:55 +01:00
Marc Zyngier
0a25cb5544 genirq/debugfs: Use irq_print_chip() when provided by irqchip
Since irqchips have the option to implement irq_print_chip, use this
when available to output the irqchip name in debugfs.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220215112154.1360040-1-maz@kernel.org
2022-02-15 11:22:34 +00:00
Marc Zyngier
393e1280f7 genirq: Allow irq_chip registration functions to take a const irq_chip
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_set_chip()() is required
to avoid a warning.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-3-maz@kernel.org
2022-02-15 11:10:21 +00:00
Marc Zyngier
45ec846c1c irqdomain: Let irq_domain_set_{info,hwirq_and_chip} take a const irq_chip
In order to let a const irqchip be fed to the irqchip layer, adjust
the various prototypes. An extra cast in irq_domain_set_hwirq_and_chip()
is required to avoid a warning.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220209162607.1118325-2-maz@kernel.org
2022-02-15 11:10:21 +00:00
Peter Zijlstra
a3d29e8291 sched: Define and initialize a flag to identify valid PASID in the task
Add a new single bit field to the task structure to track whether this task
has initialized the IA32_PASID MSR to the mm's PASID.

Initialize the field to zero when creating a new task with fork/clone.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.com
2022-02-15 11:31:43 +01:00
Fenghua Yu
701fac4038 iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
PASIDs are process-wide. It was attempted to use refcounted PASIDs to
free them when the last thread drops the refcount. This turned out to
be complex and error prone. Given the fact that the PASID space is 20
bits, which allows up to 1M processes to have a PASID associated
concurrently, PASID resource exhaustion is not a realistic concern.

Therefore, it was decided to simplify the approach and stick with lazy
on demand PASID allocation, but drop the eager free approach and make an
allocated PASID's lifetime bound to the lifetime of the process.

Get rid of the refcounting mechanisms and replace/rename the interfaces
to reflect this new approach.

  [ bp: Massage commit message. ]

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-6-fenghua.yu@intel.com
2022-02-15 11:31:35 +01:00
Zhang Qiao
05c7b7a92c cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:

     cpuset_attach()				cpu hotplug
    ---------------------------            ----------------------
    down_write(cpuset_rwsem)
    guarantee_online_cpus() // (load cpus_attach)
					sched_cpu_deactivate
					  set_cpu_active()
					  // will change cpu_active_mask
    set_cpus_allowed_ptr(cpus_attach)
      __set_cpus_allowed_ptr_locked()
       // (if the intersection of cpus_attach and
         cpu_active_mask is empty, will return -EINVAL)
    up_write(cpuset_rwsem)

To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.

Fixes: be367d0992 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-14 09:48:04 -10:00
Fenghua Yu
a6cbd44093 kernel/fork: Initialize mm's PASID
A new mm doesn't have a PASID yet when it's created. Initialize
the mm's PASID on fork() or for init_mm to INVALID_IOASID (-1).

INIT_PASID (0) is reserved for kernel legacy DMA PASID. It cannot be
allocated to a user process. Initializing the process's PASID to 0 may
cause confusion that's why the process uses the reserved kernel legacy
DMA PASID. Initializing the PASID to INVALID_IOASID (-1) explicitly
tells the process doesn't have a valid PASID yet.

Even though the only user of mm_pasid_init() is in fork.c, define it in
<linux/sched/mm.h> as the first of three mm/pasid life cycle functions
(init/set/drop) to keep these all together.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-5-fenghua.yu@intel.com
2022-02-14 19:51:47 +01:00
Yury Norov
6a2c1d450a rcu: Replace cpumask_weight with cpumask_empty where appropriate
In some places, RCU code calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Ingo Molnar
58d4292bd0 rcu: Uninline multi-use function: finish_rcuwait()
This is a rarely used function, so uninlining its 3 instructions
is probably a win or a wash - but the main motivation is to
make <linux/rcuwait.h> independent of task_struct details.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Paul E. McKenney
c099290310 rcu: Mark writes to the rcu_segcblist structure's ->flags field
KCSAN reports data races between the rcu_segcblist_clear_flags() and
rcu_segcblist_set_flags() functions, though misreporting the latter
as a call to rcu_segcblist_is_enabled() from call_rcu().  This commit
converts the updates of this field to WRITE_ONCE(), relying on the
resulting unmarked reads to continue to detect buggy concurrent writes
to this field.

Reported-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
2022-02-14 10:36:58 -08:00
Zqiang
d818cc76e2 kasan: Record work creation stack trace with interrupts enabled
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.

Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.

Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
Paul E. McKenney
1fe09ebe7a rcu: Inline __call_rcu() into call_rcu()
Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:58 -08:00
David Woodhouse
218b957a69 rcu: Add mutex for rcu boost kthread spawning and affinity setting
As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-14 10:36:35 -08:00
Fenghua Yu
7a853c2d59 mm: Change CONFIG option for mm->pasid field
This currently depends on CONFIG_IOMMU_SUPPORT. But it is only
needed when CONFIG_IOMMU_SVA option is enabled.

Change the CONFIG guards around definition and initialization
of mm->pasid field.

Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20220207230254.3342514-3-fenghua.yu@intel.com
2022-02-14 19:23:50 +01:00
Stephen Brennan
8ebc476fd5 printk: Drop console_sem during panic
If another CPU is in panic, we are about to be halted. Try to gracefully
abandon the console_sem, leaving it free for the panic CPU to grab.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-5-stephen.s.brennan@oracle.com
2022-02-14 13:39:20 +01:00
Stephen Brennan
13fb0f74d7 printk: Avoid livelock with heavy printk during panic
During panic(), if another CPU is writing heavily the kernel log (e.g.
via /dev/kmsg), then the panic CPU may livelock writing out its messages
to the console. Note when too many messages are dropped during panic and
suppress further printk, except from the panic CPU. This could result in
some important messages being dropped. However, messages are already
being dropped, so this approach at least prevents a livelock.

Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-4-stephen.s.brennan@oracle.com
2022-02-14 13:39:20 +01:00
Stephen Brennan
d51507098f printk: disable optimistic spin during panic
A CPU executing with console lock spinning enabled might be halted
during a panic. Before the panicking CPU calls console_flush_on_panic(),
it may call console_trylock(), which attempts to optimistically spin,
deadlocking the panic CPU:

CPU 0 (panic CPU)             CPU 1
-----------------             ------
                              printk() {
                                vprintk_func() {
                                  vprintk_default() {
                                    vprintk_emit() {
                                      console_unlock() {
                                        console_lock_spinning_enable();
                                        ... printing to console ...
panic() {
  crash_smp_send_stop() {
    NMI  -------------------> HALT
  }
  atomic_notifier_call_chain() {
    printk() {
      ...
      console_trylock_spinnning() {
        // optimistic spin infinitely

This hang during panic can be induced when a kdump kernel is loaded, and
crash_kexec_post_notifiers=1 is present on the kernel command line. The
following script which concurrently writes to /dev/kmsg, and triggers a
panic, can result in this hang:

    #!/bin/bash
    date
    # 991 chars (based on log buffer size):
    chars="$(printf 'a%.0s' {1..991})"
    while :; do
        echo $chars > /dev/kmsg
    done &
    echo c > /proc/sysrq-trigger &
    date
    exit

To avoid this deadlock, ensure that console_trylock_spinning() does not
allow spinning once a panic has begun.

Fixes: dbdda842fe ("printk: Add console owner and waiter logic to load balance console writes")

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-3-stephen.s.brennan@oracle.com
2022-02-14 13:39:20 +01:00
Stephen Brennan
7749861785 printk: Add panic_in_progress helper
This will be used help avoid deadlocks during panics. Although it would
be better to include this in linux/panic.h, it would require that header
to include linux/atomic.h as well. On some architectures, this results
in a circular dependency as well. So instead add the helper directly to
printk.c.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220202171821.179394-2-stephen.s.brennan@oracle.com
2022-02-14 13:39:20 +01:00
Halil Pasic
ddbd89deb7 swiotlb: fix info leak with DMA_FROM_DEVICE
The problem I'm addressing was discovered by the LTP test covering
cve-2018-1000204.

A short description of what happens follows:
1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
   interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
   and a corresponding dxferp. The peculiar thing about this is that TUR
   is not reading from the device.
2) In sg_start_req() the invocation of blk_rq_map_user() effectively
   bounces the user-space buffer. As if the device was to transfer into
   it. Since commit a45b599ad8 ("scsi: sg: allocate with __GFP_ZERO in
   sg_build_indirect()") we make sure this first bounce buffer is
   allocated with GFP_ZERO.
3) For the rest of the story we keep ignoring that we have a TUR, so the
   device won't touch the buffer we prepare as if the we had a
   DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
   and the  buffer allocated by SG is mapped by the function
   virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
   scatter-gather and not scsi generics). This mapping involves bouncing
   via the swiotlb (we need swiotlb to do virtio in protected guest like
   s390 Secure Execution, or AMD SEV).
4) When the SCSI TUR is done, we first copy back the content of the second
   (that is swiotlb) bounce buffer (which most likely contains some
   previous IO data), to the first bounce buffer, which contains all
   zeros.  Then we copy back the content of the first bounce buffer to
   the user-space buffer.
5) The test case detects that the buffer, which it zero-initialized,
  ain't all zeros and fails.

One can argue that this is an swiotlb problem, because without swiotlb
we leak all zeros, and the swiotlb should be transparent in a sense that
it does not affect the outcome (if all other participants are well
behaved).

Copying the content of the original buffer into the swiotlb buffer is
the only way I can think of to make swiotlb transparent in such
scenarios. So let's do just that if in doubt, but allow the driver
to tell us that the whole mapped buffer is going to be overwritten,
in which case we can preserve the old behavior and avoid the performance
impact of the extra bounce.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-02-14 10:22:28 +01:00
Linus Torvalds
6f35736723 - Fix a NULL-ptr dereference when recalculating a sched entity's weight
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmII+xoACgkQEsHwGGHe
 VUqiEQ/+JI1pSIk5+arj5v7B1GaiqUg2UE0MNgFErPfMsMCWMT1/3R0P+ULzy/Dl
 Tv/mQnHznfSdntUVoJ2iYCayLFAJC7AnDVq00inHQI/vIvQnwjUuoE6AYDBJx5FG
 oLu9W1eUjXt6RVlhVA3+OY4PJlxoEXCNbai7cDYoHbOYwQDGfPPuNkdcvLsnPC8Z
 xIICP4+ncTgMw4unI0edqCVYYtuKv8GUklaFlNyPv/PNXYnf1mFvGQ8w4Zq2f3og
 ndVFDj3HMijYhpiBlcLQCvq1aT11ubmCaEcKufzwXWzQfEnMhHw4he/vobuuycSG
 i9vGUO8Qo+7sRCFL0CGI3UBchTWbzUe57Aj/rWijmKl4zFWkYcc7PBDEZetfU/rS
 CrYD83WzfS+DV7NRThIpdqG8fpqGcIp40ot724XZN05NEmX3oxPNT/UabYyBx/0h
 IGb63PeQndsfzFxVx5fl6GeiuMvqLHtgVbJV3hWE5Ea8zrOBWANAyTdRet486IL/
 JKkFb8VGtCjRzjfpAoWt81U6JL6bh0yR+tfLu4QNOK3ELhDxTVsS3JnsEUJjgZvk
 4tOJhuIR4ApGYiwKGTNo9J9GfppMUcwHOxwmtjFgG17MS5Iqam7XrHdnpInB8CCF
 WrcFlEg2EiufSOrEp4Ub9Dc/uqmxmYO/NNWZxXCvNgDOlO0sKHs=
 =a9P6
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "Fix a NULL-ptr dereference when recalculating a sched entity's weight"

* tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix fault in reweight_entity
2022-02-13 09:27:26 -08:00
Linus Torvalds
f5e02656b1 - Prevent cgroup event list corruption when switching events
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmII+WQACgkQEsHwGGHe
 VUofdA//QLbJMYy5BsRFTtN7uFMmeDX8n0/9Wv06laRkq+emTTZCvKks9IkbGuXj
 nuaaI/PIxFrfrQpgY8AVgopW/DdQFjSlnKY+SyJKhVzpzR7QxyrJVj+HpYEAa9bM
 E41Hw2IS6CQ+Iy1cdE3LLd7HagAf935w1OQm+P7x7WhYtnFwpbW0u/J3QDsFUU/h
 t52Ap/NxoMfd1LSAh8ZSUfKn22pnl0NfY1FtZ42mNM0K/LLGfX1oRiAnZpLzW56c
 lS/dPCaaiqYuGRhQTtdPk9tS6tfpeeF2FzXsnrXCNBPRQbg2CxAKqne7ohTHA3uY
 eaQAS4Fxq3QN3ooVaPsiHRNbAPPPNLdaRAgy8IWM5g3CbKhzWRLd0faerkBwTQ82
 bT1OAjglLxj564dHATM5/tQsNU39EBnTBx35Y2XKuyu3XffojCvfHPtFCPik+xV/
 kscaXyMBSlvld2V57LkKZUbtL/vqkobKTfuu5TNKVp1lpKBbQOePldpe4nWPLPKR
 /egH3BUaQMMGmAIZu1I55pLpv3DaOqsaIQHYuwyRpfOi2Vsu4HHl0OzenfiKggzK
 FgFjbPABsDS42hAojx7ua+QsrjycdMh0nBQkPMT1ZTityKipqgUhbNEjIclf7oau
 V/FEroTF1HhZR6enyVbzY+9woQq4hwL++Cs32QX0xSGKsZAs1sM=
 =yedI
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:
 "Prevent cgroup event list corruption when switching events"

* tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix list corruption in perf_cgroup_switch()
2022-02-13 09:25:26 -08:00
Linus Walleij
00ba9357d1 ARM: ixp4xx: Drop custom DMA coherency and bouncing
The new PCI driver does not need any of this stuff, so just
drop it.

Cc: iommu@lists.linux-foundation.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220211223238.648934-12-linus.walleij@linaro.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2022-02-12 18:20:04 +01:00
Linus Torvalds
eef8cffcab seccomp fixes for v5.17-rc4
- Force HANDLER_EXIT even for SIGNAL_UNKILLABLE.
 - Make seccomp self-destruct after fatal filter results.
 - Update seccomp samples for easier behavioral demonstration.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmIHIccWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkuFD/9wY3voM9ryF6WW6XRz0tJ8qG6X
 YUPMp+mzyDOlF6OGcSXQL1P0gV8frfAJx23sBW0dTayWGiwEOSqi83gHVnxU6jpn
 KXfS7JOkcbY3x1A7hL38O4nIh0wkbtuHslkp6iWRrBsE4KsFY1r2pR2hYUEOJsgn
 AALsnnFdQ57JD9tbfxrTzwHMgB9Rx3ROoKoTmoG78GU5vG3RvXYyeGN0JXrX2lM3
 9N6oSYp+j2X6DpKjciQ8dsyBUpPNTfUad5v+MhsYbffwKbk+xPulJnK21KpqrfPY
 rBLvR8DAFBZAGgXuO91m8Pb5PoBOkh6bb0ImbGB1W2u11/5dFriiAEl2YNuZin1z
 d+BuuvPZ2i4g+plr1Inl/xkKPLIlobSBDGO/XGsZ+CQnI4ZzJm0ilJaxTPCwJ7Ky
 4YzyaI66AKYcGUpTg5hNScEoRuC5kXCnIyFS+DLaBT/b1/C17MW7xsGW44UcXpS2
 ROlDKgZ+fXwKlIuz0QCwjMMv5EIe8sjRpoHGd4CctodbRN3bznRNs4IXkeYytOTU
 3vjkaYoyYCoPP370qm0NkDByh0JY5tQwLKlBfb4f5mZP8Gsi7kHQW6mBk0a0xwkB
 i3PtzbRcIcRAGs2nm8qBxsAaEtCePY+V080+bEuV8NwI7Dz5C+xMtKCWa6CxmJtu
 D//y/M2p/TLc42oZAA==
 =zpOl
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "This fixes a corner case of fatal SIGSYS being ignored since v5.15.
  Along with the signal fix is a change to seccomp so that seeing
  another syscall after a fatal filter result will cause seccomp to kill
  the process harder.

  Summary:

   - Force HANDLER_EXIT even for SIGNAL_UNKILLABLE

   - Make seccomp self-destruct after fatal filter results

   - Update seccomp samples for easier behavioral demonstration"

* tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  samples/seccomp: Adjust sample to also provide kill option
  seccomp: Invalidate seccomp mode to catch death failures
  signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
2022-02-12 09:04:05 -08:00
Mel Gorman
e496132ebe sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Commit 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v6
MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v6
Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla       sched-numaimb-v6
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
2022-02-11 23:30:08 +01:00
Mel Gorman
2cfb7a1b03 sched/fair: Improve consistency of allowed NUMA balance calculations
There are inconsistencies when determining if a NUMA imbalance is allowed
that should be corrected.

o allow_numa_imbalance changes types and is not always examining
  the destination group so both the type should be corrected as
  well as the naming.
o find_idlest_group uses the sched_domain's weight instead of the
  group weight which is different to find_busiest_group
o find_busiest_group uses the source group instead of the destination
  which is different to task_numa_find_cpu
o Both find_idlest_group and find_busiest_group should account
  for the number of running tasks if a move was allowed to be
  consistent with task_numa_find_cpu

Fixes: 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
2022-02-11 23:30:08 +01:00
Cheng Jui Wang
28df029d53 lockdep: Correct lock_classes index mapping
A kernel exception was hit when trying to dump /proc/lockdep_chains after
lockdep report "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!":

Unable to handle kernel paging request at virtual address 00054005450e05c3
...
00054005450e05c3] address between user and kernel address ranges
...
pc : [0xffffffece769b3a8] string+0x50/0x10c
lr : [0xffffffece769ac88] vsnprintf+0x468/0x69c
...
 Call trace:
  string+0x50/0x10c
  vsnprintf+0x468/0x69c
  seq_printf+0x8c/0xd8
  print_name+0x64/0xf4
  lc_show+0xb8/0x128
  seq_read_iter+0x3cc/0x5fc
  proc_reg_read_iter+0xdc/0x1d4

The cause of the problem is the function lock_chain_get_class() will
shift lock_classes index by 1, but the index don't need to be shifted
anymore since commit 01bb6f0af9 ("locking/lockdep: Change the range
of class_idx in held_lock struct") already change the index to start
from 0.

The lock_classes[-1] located at chain_hlocks array. When printing
lock_classes[-1] after the chain_hlocks entries are modified, the
exception happened.

The output of lockdep_chains are incorrect due to this problem too.

Fixes: f611e8cf98 ("lockdep: Take read/write status in consideration when generate chainkey")
Signed-off-by: Cheng Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20220210105011.21712-1-cheng-jui.wang@mediatek.com
2022-02-11 23:30:02 +01:00
Yonghong Song
3bd916ee0e bpf: Emit bpf_timer in vmlinux BTF
Currently the following code in check_and_init_map_value()
  *(struct bpf_timer *)(dst + map->timer_off) =
      (struct bpf_timer){};
can help generate bpf_timer definition in vmlinuxBTF.
But the code above may not zero the whole structure
due to anonymour members and that code will be replaced
by memset in the subsequent patch and
bpf_timer definition will disappear from vmlinuxBTF.
Let us emit the type explicitly so bpf program can continue
to use it from vmlinux.h.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211194948.3141529-1-yhs@fb.com
2022-02-11 13:21:47 -08:00
Linus Torvalds
883fd0aba1 ACPI fixes for 5.17-rc4
- Revert a recent change that attempted to avoid issues with
    conflicting address ranges during PCI initialization, because it
    turned out to introduce a regression (Hans de Goede).
 
  - Revert a change that limited EC GPE wakeups from suspend-to-idle
    to systems based on Intel hardware, because it turned out that
    systems based on hardware from other vendors depended on that
    functionality too (Mario Limonciello).
 
  - Fix two issues related to the handling of wakeup interrupts and
    wakeup events signaled through the EC GPE during suspend-to-idle
    on x86 (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmIGkq4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxs3MQAKhmnM5mHOqFIM7VM9SoutEoOIXPWLCz
 CNjjqfwA63GvxTp//Dks4KGCLF90mfFOv8Mt7vlzVHyQDvTv2w0YddmBxQAgXTU4
 v1yfHcRssND1w2kgIg7wThWSxSKvP/4XgqwrhTmEKHb1qPFrpq9pt7+4RIco9362
 pacxW1QPX2+5yNDs67JuTsjh2mksKC8CUgiA8BUa+57jrIXvQYWqqZ490PpkbXW0
 Nh1naNwT1xDqte5U98PrzZZRt2qUMuKrG2ro0lE97rb067zSoIQo2XCODlo7T12O
 7vFzypOTvNJZjd57U9SoKyEDCzDnHkNh2O2jiNKaqzMJHqgh0bkMrMHysPbdTwMu
 VH0fw+VElRONyQ5knUcvTG3IRfEuTBl2iqDoO+hLb+cmkL48KL9yJXH7dTRMrNJ8
 zqHSCqpON8rZgLfDzrxoVvMXv9Al5ra5wM41EljiUWUDFEB1HbtleysHIMFMxEKy
 gBvO6zAyEoh5ie2pxYdSAyY+pq+d1JFeAgpbXpF3miQFE2gchC1BjNjUCguekfty
 t0kwOefLOphgShpIm/04F2WBsYNM+QSIoJWRvE5LsEVTdD/j59MweCTJN2KNSIpj
 vapl8ymAr6umMrPhxpjY6K4DD000+LYpV9WJ8Q1Wv2HWV9OkevL5KnEXoQ94czpb
 ybb/Nd4SwHV+
 =aGYp
 -----END PGP SIGNATURE-----

Merge tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These revert two commits that turned out to be problematic and fix two
  issues related to wakeup from suspend-to-idle on x86.

  Specifics:

   - Revert a recent change that attempted to avoid issues with
     conflicting address ranges during PCI initialization, because it
     turned out to introduce a regression (Hans de Goede).

   - Revert a change that limited EC GPE wakeups from suspend-to-idle to
     systems based on Intel hardware, because it turned out that systems
     based on hardware from other vendors depended on that functionality
     too (Mario Limonciello).

   - Fix two issues related to the handling of wakeup interrupts and
     wakeup events signaled through the EC GPE during suspend-to-idle on
     x86 (Rafael Wysocki)"

* tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
  PM: s2idle: ACPI: Fix wakeup interrupts handling
  ACPI: PM: s2idle: Cancel wakeup before dispatching EC GPE
  ACPI: PM: Revert "Only mark EC GPE for wakeup on Intel systems"
2022-02-11 11:48:13 -08:00
Linus Torvalds
32f6c5d037 Tracing fixes:
- Fixes to the RTLA tooling.
 
  - A fix to a tp_printk overriding tp_printk_stop_on_boot on command line.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYgWtxxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsaaAQD+4lcpIRKdkfGb09xMlh8Gr8OvRoVb
 5XAhzHVpETjGUAEAyIJAG+7Epw/St8FCSupNAEWTzGghjhoJhFblTd17jAg=
 =CCON
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fixes to the RTLA tooling

 - A fix to a tp_printk overriding tp_printk_stop_on_boot on the
   command line

* tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix tp_printk option related with tp_printk_stop_on_boot
  MAINTAINERS: Add RTLA entry
  rtla: Fix segmentation fault when failing to enable -t
  rtla/trace: Error message fixup
  rtla/utils: Fix session duration parsing
  rtla: Follow kernel version
2022-02-11 10:22:48 -08:00
Minchan Kim
c441e934b6 locking: Add missing __sched attributes
This patch adds __sched attributes to a few missing places
to show blocked function rather than locking function
in get_wchan.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220115231657.84828-1-minchan@kernel.org
2022-02-11 12:13:55 +01:00
Waiman Long
ddc204b517
copy_process(): Move fd_install() out of sighand->siglock critical section
I was made aware of the following lockdep splat:

[ 2516.308763] =====================================================
[ 2516.309085] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 2516.309433] 5.14.0-51.el9.aarch64+debug #1 Not tainted
[ 2516.309703] -----------------------------------------------------
[ 2516.310149] stress-ng/153663 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 2516.310512] ffff0000e422b198 (&newf->file_lock){+.+.}-{2:2}, at: fd_install+0x368/0x4f0
[ 2516.310944]
               and this task is already holding:
[ 2516.311248] ffff0000c08140d8 (&sighand->siglock){-.-.}-{2:2}, at: copy_process+0x1e2c/0x3e80
[ 2516.311804] which would create a new lock dependency:
[ 2516.312066]  (&sighand->siglock){-.-.}-{2:2} -> (&newf->file_lock){+.+.}-{2:2}
[ 2516.312446]
               but this new dependency connects a HARDIRQ-irq-safe lock:
[ 2516.312983]  (&sighand->siglock){-.-.}-{2:2}
   :
[ 2516.330700]  Possible interrupt unsafe locking scenario:

[ 2516.331075]        CPU0                    CPU1
[ 2516.331328]        ----                    ----
[ 2516.331580]   lock(&newf->file_lock);
[ 2516.331790]                                local_irq_disable();
[ 2516.332231]                                lock(&sighand->siglock);
[ 2516.332579]                                lock(&newf->file_lock);
[ 2516.332922]   <Interrupt>
[ 2516.333069]     lock(&sighand->siglock);
[ 2516.333291]
                *** DEADLOCK ***
[ 2516.389845]
               stack backtrace:
[ 2516.390101] CPU: 3 PID: 153663 Comm: stress-ng Kdump: loaded Not tainted 5.14.0-51.el9.aarch64+debug #1
[ 2516.390756] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2516.391155] Call trace:
[ 2516.391302]  dump_backtrace+0x0/0x3e0
[ 2516.391518]  show_stack+0x24/0x30
[ 2516.391717]  dump_stack_lvl+0x9c/0xd8
[ 2516.391938]  dump_stack+0x1c/0x38
[ 2516.392247]  print_bad_irq_dependency+0x620/0x710
[ 2516.392525]  check_irq_usage+0x4fc/0x86c
[ 2516.392756]  check_prev_add+0x180/0x1d90
[ 2516.392988]  validate_chain+0x8e0/0xee0
[ 2516.393215]  __lock_acquire+0x97c/0x1e40
[ 2516.393449]  lock_acquire.part.0+0x240/0x570
[ 2516.393814]  lock_acquire+0x90/0xb4
[ 2516.394021]  _raw_spin_lock+0xe8/0x154
[ 2516.394244]  fd_install+0x368/0x4f0
[ 2516.394451]  copy_process+0x1f5c/0x3e80
[ 2516.394678]  kernel_clone+0x134/0x660
[ 2516.394895]  __do_sys_clone3+0x130/0x1f4
[ 2516.395128]  __arm64_sys_clone3+0x5c/0x7c
[ 2516.395478]  invoke_syscall.constprop.0+0x78/0x1f0
[ 2516.395762]  el0_svc_common.constprop.0+0x22c/0x2c4
[ 2516.396050]  do_el0_svc+0xb0/0x10c
[ 2516.396252]  el0_svc+0x24/0x34
[ 2516.396436]  el0t_64_sync_handler+0xa4/0x12c
[ 2516.396688]  el0t_64_sync+0x198/0x19c
[ 2517.491197] NET: Registered PF_ATMPVC protocol family
[ 2517.491524] NET: Registered PF_ATMSVC protocol family
[ 2591.991877] sched: RT throttling activated

One way to solve this problem is to move the fd_install() call out of
the sighand->siglock critical section.

Before commit 6fd2fe494b ("copy_process(): don't use ksys_close()
on cleanups"), the pidfd installation was done without holding both
the task_list lock and the sighand->siglock. Obviously, holding these
two locks are not really needed to protect the fd_install() call.
So move the fd_install() call down to after the releases of both locks.

Link: https://lore.kernel.org/r/20220208163912.1084752-1-longman@redhat.com
Fixes: 6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups")
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2022-02-11 09:28:32 +01:00
Beau Belgrave
2467cda1b5 user_events: Validate user payloads for size and null termination
Add validation to ensure data is at or greater than the min size for the
fields of the event. If a dynamic array is used and is a type of char,
ensure null termination of the array exists.

Link: https://lkml.kernel.org/r/20220118204326.2169-7-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:37:38 -05:00
Beau Belgrave
0279400ad3 user_events: Optimize writing events by only copying data once
Pass iterator through to probes to allow copying data directly to the
probe buffers instead of taking multiple copies. Enables eBPF user and
raw iterator types out to programs for no-copy scenarios.

Link: https://lkml.kernel.org/r/20220118204326.2169-6-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:37:31 -05:00
Beau Belgrave
3207d0459e user_events: Add basic perf and eBPF support
Adds support to write out user_event data to perf_probe/perf files as
well as to any attached eBPF program.

Link: https://lkml.kernel.org/r/20220118204326.2169-5-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:37:22 -05:00
Beau Belgrave
9aed4e157d user_events: Handle matching arguments from dyn_events
Ensures that when dynamic events requests a match with arguments that
they match what is in the user_event.

Link: https://lkml.kernel.org/r/20220118204326.2169-4-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:37:16 -05:00
Beau Belgrave
aa3b2b4c66 user_events: Add print_fmt generation support for basic types
Addes print_fmt format generation for basic types that are supported for
user processes. Only supports sizes that are the same on 32 and 64 bit.

Link: https://lkml.kernel.org/r/20220118204326.2169-3-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:37:08 -05:00
Beau Belgrave
7f5a08c79d user_events: Add minimal support for trace_event into ftrace
Minimal support for interacting with dynamic events, trace_event and
ftrace. Core outline of flow between user process, ioctl and trace_event
APIs.

User mode processes that wish to use trace events to get data into
ftrace, perf, eBPF, etc are limited to uprobes today. The user events
features enables an ABI for user mode processes to create and write to
trace events that are isolated from kernel level trace events. This
enables a faster path for tracing from user mode data as well as opens
managed code to participate in trace events, where stub locations are
dynamic.

User processes often want to trace only when it's useful. To enable this
a set of pages are mapped into the user process space that indicate the
current state of the user events that have been registered. User
processes can check if their event is hooked to a trace/probe, and if it
is, emit the event data out via the write() syscall.

Two new files are introduced into tracefs to accomplish this:
user_events_status - This file is mmap'd into participating user mode
processes to indicate event status.

user_events_data - This file is opened and register/delete ioctl's are
issued to create/open/delete trace events that can be used for tracing.

The typical scenario is on process start to mmap user_events_status. Processes
then register the events they plan to use via the REG ioctl. The ioctl reads
and updates the passed in user_reg struct. The status_index of the struct is
used to know the byte in the status page to check for that event. The
write_index of the struct is used to describe that event when writing out to
the fd that was used for the ioctl call. The data must always include this
index first when writing out data for an event. Data can be written either by
write() or by writev().

For example, in memory:
int index;
char data[];

Psuedo code example of typical usage:
struct user_reg reg;

int page_fd = open("user_events_status", O_RDWR);
char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
close(page_fd);

int data_fd = open("user_events_data", O_RDWR);

reg.size = sizeof(reg);
reg.name_args = (__u64)"test";

ioctl(data_fd, DIAG_IOCSREG, &reg);
int status_id = reg.status_index;
int write_id = reg.write_index;

struct iovec io[2];
io[0].iov_base = &write_id;
io[0].iov_len = sizeof(write_id);
io[1].iov_base = payload;
io[1].iov_len = sizeof(payload);

if (page_data[status_id])
	writev(data_fd, io, 2);

User events are also exposed via the dynamic_events tracefs file for
both create and delete. Current status is exposed via the user_events_status
tracefs file.

Simple example to register a user event via dynamic_events:
	echo u:test >> dynamic_events
	cat dynamic_events
	u:test

If an event is hooked to a probe, the probe hooked shows up:
	echo 1 > events/user_events/test/enable
	cat user_events_status
	1:test # Used by ftrace

	Active: 1
	Busy: 1
	Max: 4096

If an event is not hooked to a probe, no probe status shows up:
	echo 0 > events/user_events/test/enable
	cat user_events_status
	1:test

	Active: 1
	Busy: 0
	Max: 4096

Users can describe the trace event format via the following format:
	name[:FLAG1[,FLAG2...] [field1[;field2...]]

Each field has the following format:
	type name

Example for char array with a size of 20 named msg:
	echo 'u:detailed char[20] msg' >> dynamic_events
	cat dynamic_events
	u:detailed char[20] msg

Data offsets are based on the data written out via write() and will be
updated to reflect the correct offset in the trace_event fields. For dynamic
data it is recommended to use the new __rel_loc data type. This type will be
the same as __data_loc, but the offset is relative to this entry. This allows
user_events to not worry about what common fields are being inserted before
the data.

The above format is valid for both the ioctl and the dynamic_events file.

Link: https://lkml.kernel.org/r/20220118204326.2169-2-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:36:46 -05:00
Steven Rostedt (Google)
55bc8384d3 tracing: Save both wakee and current on wakeup events
Use the sched_switch function to save both the wakee and the waker comms
in the saved cmdlines list when sched_wakeup is done.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:27:41 -05:00
Tom Zanussi
27c888da98 tracing: Remove size restriction on synthetic event cmd error logging
Currently, synthetic event command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).

Remove the restriction so that no synthetic event command error string
is ever truncated.

Link: https://lkml.kernel.org/r/0376692396a81d0b795127c66ea92ca5bf60f481.1643399022.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:27:40 -05:00
Tom Zanussi
edfeed318d tracing: Remove size restriction on hist trigger cmd error logging
Currently, hist trigger command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).

Remove the restriction so that no hist trigger command error string is
ever truncated.

Link: https://lkml.kernel.org/r/0f9d46407222eaf6632cd3b417bc50a11f401b71.1643399022.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:27:33 -05:00
Tom Zanussi
1581a884b7 tracing: Remove size restriction on tracing_log_err cmd strings
Currently, tracing_log_err.cmd strings are restricted to a length of
MAX_FILTER_STR_VAL (256), which is too short for some commands already
seen in the wild (with cmd strings longer than that showing up
truncated).

Remove the restriction so that no command string is ever truncated.

Link: https://lkml.kernel.org/r/ca965f23256b350ebd94b3dc1a319f28e8267f5f.1643319703.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-10 22:27:17 -05:00
Kees Cook
495ac3069a seccomp: Invalidate seccomp mode to catch death failures
If seccomp tries to kill a process, it should never see that process
again. To enforce this proactively, switch the mode to something
impossible. If encountered: WARN, reject all syscalls, and attempt to
kill the process again even harder.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Fixes: 8112c4f140 ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2022-02-10 19:09:12 -08:00
Kees Cook
5c72263ef2 signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
were not being delivered to ptraced pid namespace init processes. Make
sure the SIGNAL_UNKILLABLE doesn't get set for these cases.

Reported-by: Robert Święcki <robert@swiecki.net>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org
2022-02-10 19:08:54 -08:00
Song Liu
4cc0991abd bpf: Fix bpf_prog_pack build for ppc64_defconfig
bpf_prog_pack causes build error with powerpc ppc64_defconfig:

kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope
  830 |         unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
      |                       ^~~~~~

This is because the marco expands as:

unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \
     + ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))];

where __pte_index_size is a global variable.

Fix it by turning bitmap into a 0-length array.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org
2022-02-10 18:59:32 -08:00
Jakub Kicinski
5b91c5cc0e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-10 17:29:56 -08:00
Alexei Starovoitov
cb80ddc671 bpf: Convert bpf_preload.ko to use light skeleton.
The main change is a move of the single line
  #include "iterators.lskel.h"
from iterators/iterators.c to bpf_preload_kern.c.
Which means that generated light skeleton can be used from user space or
user mode driver like iterators.c or from the kernel module or the kernel itself.
The direct use of light skeleton from the kernel module simplifies the code,
since UMD is no longer necessary. The libbpf.a required user space and UMD. The
CO-RE in the kernel and generated "loader bpf program" used by the light
skeleton are capable to perform complex loading operations traditionally
provided by libbpf. In addition UMD approach was launching UMD process
every time bpffs has to be mounted. With light skeleton in the kernel
the bpf_preload kernel module loads bpf iterators once and pins them
multiple times into different bpffs mounts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-6-alexei.starovoitov@gmail.com
2022-02-10 23:31:51 +01:00
Alexei Starovoitov
d7beb3d6ab bpf: Update iterators.lskel.h.
Light skeleton and skel_internal.h have changed.
Update iterators.lskel.h.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-5-alexei.starovoitov@gmail.com
2022-02-10 23:31:51 +01:00
Alexei Starovoitov
b1d18a7574 bpf: Extend sys_bpf commands for bpf_syscall programs.
bpf_sycall programs can be used directly by the kernel modules
to load programs and create maps via kernel skeleton.
. Export bpf_sys_bpf syscall wrapper to be used in kernel skeleton.
. Export bpf_map_get to be used in kernel skeleton.
. Allow prog_run cmd for bpf_syscall programs with recursion check.
. Enable link_create and raw_tp_open cmds.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-2-alexei.starovoitov@gmail.com
2022-02-10 23:31:51 +01:00
Linus Torvalds
252787201e audit/stable-5.17 PR 20220209
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmIEf2AUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNB3A/7BzHTEAqMeOZx9x8L0f4dFpqsyRba
 1wSyWwH6cPqGZew6IRbrj5hSv8XTAQKtqmhJJNfs4CoKV5pX+XnyKuPDGblBAkFn
 uRgzQb6SQSQU7Bm/qWziwa/PVFmkkwbEG/08O5nKQ3cK4JEXzLPuDtN1T196Nlxf
 IrbC3E18juIoFlzKk2oLwoYWB6JkQ3GmTnIQ7g91/673U8w4mrAH3ZBTFl8vKaqu
 TiWMIujjVrEhgpFmKZTRz+5EeHxuB8XAY6NZH7qI3FcMWhyk/at7orF6JsaKZeyX
 /UW0m2Ztr9d4r4U0H/sMJgLmEbj0JApboIAp/O35KsfQi2yquH71c/fY1qW54WD6
 Gsn503VkkgZ0J+DWHSk1hQxAylpJ8Jv2W7Rukk74uCXQAiPEI4xq00ZyueIEF2fM
 V10oSJtISS9nAwuoZVMQbre+mFsHjuA6qIPutoWJtEsk2f9fnFxKvOtAbhjNLKG0
 4CmPIXXKm3h4umdMkh6wsnqPk0UcP6FSDX/bbea+YGKELKYXbbIDhDUbFnfdViie
 wuKFFMRAH8N6IiPa7jyKhvLYUdp2coCxgsiaN+sZ5Z9+0O/tLu40quH0TCsjvrog
 Ol/en8rIdccgkBCCo/N2y3h4CzYZ0R5ODUTOTfx9MxciBzfUeRhnvN1bNoQxYSjH
 KSw7Ppu8b98Iflk=
 =iaaT
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20220209' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "Another audit fix, this time a single rather small but important fix
  for an oops/page-fault caused by improperly accessing userspace
  memory"

* tag 'audit-pr-20220209' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: don't deref the syscall args when checking the openat2 open_how::flags
2022-02-10 05:43:43 -08:00
Marc Zyngier
beb0622138 genirq: Kill irq_chip::parent_device
Now that noone is using irq_chip::parent_device in the tree, get
rid of it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bartosz Golaszewski <brgl@bgdev.pl>
Link: https://lore.kernel.org/r/20220201120310.878267-13-maz@kernel.org
2022-02-10 11:07:04 +00:00
Jakub Kicinski
1127170d45 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-02-09

We've added 126 non-merge commits during the last 16 day(s) which contain
a total of 201 files changed, 4049 insertions(+), 2215 deletions(-).

The main changes are:

1) Add custom BPF allocator for JITs that pack multiple programs into a huge
   page to reduce iTLB pressure, from Song Liu.

2) Add __user tagging support in vmlinux BTF and utilize it from BPF
   verifier when generating loads, from Yonghong Song.

3) Add per-socket fast path check guarding from cgroup/BPF overhead when
   used by only some sockets, from Pavel Begunkov.

4) Continued libbpf deprecation work of APIs/features and removal of their
   usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko
   and various others.

5) Improve BPF instruction set documentation by adding byte swap
   instructions and cleaning up load/store section, from Christoph Hellwig.

6) Switch BPF preload infra to light skeleton and remove libbpf dependency
   from it, from Alexei Starovoitov.

7) Fix architecture-agnostic macros in libbpf for accessing syscall
   arguments from BPF progs for non-x86 architectures,
   from Ilya Leoshkevich.

8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be
   of 16-bit field with anonymous zero padding, from Jakub Sitnicki.

9) Add new bpf_copy_from_user_task() helper to read memory from a different
   task than current. Add ability to create sleepable BPF iterator progs,
   from Kenny Yu.

10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and
    utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski.

11) Generate temporary netns names for BPF selftests to avoid naming
    collisions, from Hangbin Liu.

12) Implement bpf_core_types_are_compat() with limited recursion for
    in-kernel usage, from Matteo Croce.

13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5
    to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor.

14) Misc minor fixes to libbpf and selftests from various folks.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits)
  selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup
  bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
  libbpf: Fix compilation warning due to mismatched printf format
  selftests/bpf: Test BPF_KPROBE_SYSCALL macro
  libbpf: Add BPF_KPROBE_SYSCALL macro
  libbpf: Fix accessing the first syscall argument on s390
  libbpf: Fix accessing the first syscall argument on arm64
  libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL
  selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390
  libbpf: Fix accessing syscall arguments on riscv
  libbpf: Fix riscv register names
  libbpf: Fix accessing syscall arguments on powerpc
  selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro
  libbpf: Add PT_REGS_SYSCALL_REGS macro
  selftests/bpf: Fix an endianness issue in bpf_syscall_macro test
  bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
  bpf: Fix leftover header->pages in sparc and powerpc code.
  libbpf: Fix signedness bug in btf_dump_array_data()
  selftests/bpf: Do not export subtest as standalone test
  bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures
  ...
====================

Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09 18:40:56 -08:00
Paul Moore
7a82f89de9 audit: don't deref the syscall args when checking the openat2 open_how::flags
As reported by Jeff, dereferencing the openat2 syscall argument in
audit_match_perm() to obtain the open_how::flags can result in an
oops/page-fault.  This patch fixes this by using the open_how struct
that we store in the audit_context with audit_openat2_how().

Independent of this patch, Richard Guy Briggs posted a similar patch
to the audit mailing list roughly 40 minutes after this patch was
posted.

Cc: stable@vger.kernel.org
Fixes: 1c30e3af8a ("audit: add support for the openat2 syscall")
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-02-09 16:04:26 -05:00
Marc Zyngier
1f8863bfb5 genirq: Allow the PM device to originate from irq domain
As a preparation to moving the reference to the device used for
runtime power management, add a new 'dev' field to the irqdomain
structure for that exact purpose.

The irq_chip_pm_{get,put}() helpers are made aware of the dual
location via a new private helper.

No functional change intended.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Bartosz Golaszewski <brgl@bgdev.pl>
Link: https://lore.kernel.org/r/20220201120310.878267-2-maz@kernel.org
2022-02-09 13:35:56 +00:00
Song Liu
c1b13a9451 bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
Fix build with CONFIG_TRANSPARENT_HUGEPAGE=n with BPF_PROG_PACK_SIZE as
PAGE_SIZE.

Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220208220509.4180389-3-song@kernel.org
2022-02-08 18:37:10 -08:00
JaeSang Yoo
3203ce39ac tracing: Fix tp_printk option related with tp_printk_stop_on_boot
The kernel parameter "tp_printk_stop_on_boot" starts with "tp_printk" which is
the same as another kernel parameter "tp_printk". If "tp_printk" setup is
called before the "tp_printk_stop_on_boot", it will override the latter
and keep it from being set.

This is similar to other kernel parameter issues, such as:
  Commit 745a600cf1 ("um: console: Ignore console= option")
or init/do_mounts.c:45 (setup function of "ro" kernel param)

Fix it by checking for a "_" right after the "tp_printk" and if that
exists do not process the parameter.

Link: https://lkml.kernel.org/r/20220208195421.969326-1-jsyoo5b@gmail.com

Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
[ Fixed up change log and added space after if condition ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-02-08 15:47:00 -05:00
Paul E. McKenney
00a8b4b54c rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
Currently, call_rcu_tasks_generic() sets ->percpu_enqueue_shift to
order_base_2(nr_cpu_ids) upon encountering sufficient contention.
This does not shift to use of non-CPU-0 callback queues as intended, but
rather continues using only CPU 0's queue.  Although this does provide
some decrease in contention due to spreading work over multiple locks,
it is not the dramatic decrease that was intended.

This commit therefore makes call_rcu_tasks_generic() set
->percpu_enqueue_shift to 0.

Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:13:22 -08:00
Paul E. McKenney
2bcd18e041 rcu-tasks: Use order_base_2() instead of ilog2()
The ilog2() function can be used to generate a shift count, but it will
generate the same count for a power of two as for one greater than a power
of two.  This results in shift counts that are larger than necessary for
systems with a power-of-two number of CPUs because the CPUs are numbered
from zero, so that the maximum CPU number is one less than that power
of two.

This commit therefore substitutes order_base_2(), which appears to have
been designed for exactly this use case.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:13:12 -08:00
Paul E. McKenney
5ae0f1b58b rcu: Create and use an rcu_rdp_cpu_online()
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective.  This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.

[ paulmck: Apply kernel test robot unused-variable feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
80b3fd474c rcu: Make rcu_barrier() no longer block CPU-hotplug operations
This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution.  Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides.  Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.

Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low.  It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.

Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.

1. CPU0 calls rcu_barrier().

2. CPU1 starts offlining for CPU2. CPU1 calls
   rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
   from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
   It does not entrain ->barrier_head for CPU2, as rcu_barrier()
   on CPU0 hasn't started the barrier sequence (by calling
   rcu_seq_start(&rcu_state.barrier_sequence)) yet.

3. CPU0 starts new barrier sequence. It iterates over
   CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
   and finds 0 segcblist length. It updates ->barrier_seq_snap
   for CPU0 and CPU1 and continues loop iteration to CPU2.

    for_each_possible_cpu(cpu) {
        raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
        if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
            WRITE_ONCE(rdp->barrier_seq_snap, gseq);
            raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
            rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
            continue;
        }

4. rcutree_migrate_callbacks() completes execution on CPU1.
   Segcblist len for CPU2 becomes 0.

5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
   for CPU2 and completes the loop iteration after setting
   ->barrier_seq_snap.

6. As there isn't any ->barrier_head callback entrained; at
   this point, rcu_barrier() in CPU0 returns.

7. The callbacks, which migrated from CPU2 to CPU1, execute.

Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:

1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
   rcu_seq_start() and init_completion(), but does not yet initialize
   rcu_state.barrier_cpu_count.

2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
   which in turn calls rcu_barrier_entrain() holding CPU2's.
   rdp->barrier_lock.  It then entrains ->barrier_head for CPU2
   and atomically increments rcu_state.barrier_cpu_count, which is
   unfortunately not yet initialized to the value 2.

3. The just-entrained RCU callback is invoked.  It atomically
   decrements rcu_state.barrier_cpu_count and sees that it is
   now zero.  This callback therefore invokes complete().

4. CPU0 continues executing rcu_barrier(), but is not blocked
   by its call to wait_for_completion().  This results in rcu_barrier()
   returning before all pre-existing callbacks have been invoked,
   which is a bug.

Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2.  In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
a16578dd5e rcu: Rework rcu_barrier() and callback-migration logic
This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations.  The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.

This commit adds synchronization with RCU's CPU-hotplug notifiers.  Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().

[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
Paul E. McKenney
0cabb47af3 rcu: Refactor rcu_barrier() empty-list handling
This commit saves a few lines by checking first for an empty callback
list.  If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state.  Also simplify tracing accordingly
and fold a few lines together.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:12:28 -08:00
David Woodhouse
82980b1622 rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.

However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.

So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.

This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).

There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.

At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-08 10:11:41 -08:00
Song Liu
33c9805860 bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]
This is the jit binary allocator built on top of bpf_prog_pack.

bpf_prog_pack allocates RO memory, which cannot be used directly by the
JIT engine. Therefore, a temporary rw buffer is allocated for the JIT
engine. Once JIT is done, bpf_jit_binary_pack_finalize is used to copy
the program to the RO memory.

bpf_jit_binary_pack_alloc reserves 16 bytes of extra space for illegal
instructions, which is small than the 128 bytes space reserved by
bpf_jit_binary_alloc. This change is necessary for bpf_jit_binary_hdr
to find the correct header. Also, flag use_bpf_prog_pack is added to
differentiate a program allocated by bpf_jit_binary_pack_alloc.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-9-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
57631054fa bpf: Introduce bpf_prog_pack allocator
Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could add significant
pressure to instruction TLB. High iTLB pressure usually causes slow down
for the whole system, which includes visible performance degradation for
production workloads.

Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
page. The memory is then allocated in 64 byte chunks.

Memory allocated by bpf_prog_pack allocator is RO protected after initial
allocation. To write to it, the user (jit engine) need to use text poke
API.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-8-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
ebc1415d9b bpf: Introduce bpf_arch_text_copy
This will be used to copy JITed text to RO protected module memory. On
x86, bpf_arch_text_copy is implemented with text_poke_copy.

bpf_arch_text_copy returns pointer to dst on success, and ERR_PTR(errno)
on errors.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-7-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
d00c6473b1 bpf: Use prog->jited_len in bpf_prog_ksym_set_addr()
Using prog->jited_len is simpler and more accurate than current
estimation (header + header->size).

Also, fix missing prog->jited_len with multi function program. This hasn't
been a real issue before this.

Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-5-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
ed2d9e1a26 bpf: Use size instead of pages in bpf_binary_header
This is necessary to charge sub page memory for the BPF program.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-4-song@kernel.org
2022-02-07 18:13:01 -08:00
Song Liu
3486bedd99 bpf: Use bytes instead of pages for bpf_jit_[charge|uncharge]_modmem
This enables sub-page memory charge and allocation.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204185742.271030-3-song@kernel.org
2022-02-07 18:13:01 -08:00
Rafael J. Wysocki
cb1f65c1e1 PM: s2idle: ACPI: Fix wakeup interrupts handling
After commit e3728b50cd ("ACPI: PM: s2idle: Avoid possible race
related to the EC GPE") wakeup interrupts occurring immediately after
the one discarded by acpi_s2idle_wake() may be missed.  Moreover, if
the SCI triggers again immediately after the rearming in
acpi_s2idle_wake(), that wakeup may be missed too.

The problem is that pm_system_irq_wakeup() only calls pm_system_wakeup()
when pm_wakeup_irq is 0, but that's not the case any more after the
interrupt causing acpi_s2idle_wake() to run until pm_wakeup_irq is
cleared by the pm_wakeup_clear() call in s2idle_loop().  However,
there may be wakeup interrupts occurring in that time frame and if
that happens, they will be missed.

To address that issue first move the clearing of pm_wakeup_irq to
the point at which it is known that the interrupt causing
acpi_s2idle_wake() to tun will be discarded, before rearming the SCI
for wakeup.  Moreover, because that only reduces the size of the
time window in which the issue may manifest itself, allow
pm_system_irq_wakeup() to register two second wakeup interrupts in
a row and, when discarding the first one, replace it with the second
one.  [Of course, this assumes that only one wakeup interrupt can be
discarded in one go, but currently that is the case and I am not
aware of any plans to change that.]

Fixes: e3728b50cd ("ACPI: PM: s2idle: Avoid possible race related to the EC GPE")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-02-07 21:02:31 +01:00
Christophe Leroy
2f293651ec livepatch: Fix build failure on 32 bits processors
Trying to build livepatch on powerpc/32 results in:

	kernel/livepatch/core.c: In function 'klp_resolve_symbols':
	kernel/livepatch/core.c:221:23: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
	  221 |                 sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
	      |                       ^
	kernel/livepatch/core.c:221:21: error: assignment to 'Elf32_Sym *' {aka 'struct elf32_sym *'} from incompatible pointer type 'Elf64_Sym *' {aka 'struct elf64_sym *'} [-Werror=incompatible-pointer-types]
	  221 |                 sym = (Elf64_Sym *)sechdrs[symndx].sh_addr + ELF_R_SYM(relas[i].r_info);
	      |                     ^
	kernel/livepatch/core.c: In function 'klp_apply_section_relocs':
	kernel/livepatch/core.c:312:35: error: passing argument 1 of 'klp_resolve_symbols' from incompatible pointer type [-Werror=incompatible-pointer-types]
	  312 |         ret = klp_resolve_symbols(sechdrs, strtab, symndx, sec, sec_objname);
	      |                                   ^~~~~~~
	      |                                   |
	      |                                   Elf32_Shdr * {aka struct elf32_shdr *}
	kernel/livepatch/core.c:193:44: note: expected 'Elf64_Shdr *' {aka 'struct elf64_shdr *'} but argument is of type 'Elf32_Shdr *' {aka 'struct elf32_shdr *'}
	  193 | static int klp_resolve_symbols(Elf64_Shdr *sechdrs, const char *strtab,
	      |                                ~~~~~~~~~~~~^~~~~~~

Fix it by using the right types instead of forcing 64 bits types.

Fixes: 7c8e2bdd5f ("livepatch: Apply vmlinux-specific KLP relocations early")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5288e11b018a762ea3351cc8fb2d4f15093a4457.1640017960.git.christophe.leroy@csgroup.eu
2022-02-07 21:03:10 +11:00
Song Liu
5f4e5ce638 perf: Fix list corruption in perf_cgroup_switch()
There's list corruption on cgrp_cpuctx_list. This happens on the
following path:

  perf_cgroup_switch: list_for_each_entry(cgrp_cpuctx_list)
      cpu_ctx_sched_in
         ctx_sched_in
            ctx_pinned_sched_in
              merge_sched_in
                  perf_cgroup_event_disable: remove the event from the list

Use list_for_each_entry_safe() to allow removing an entry during
iteration.

Fixes: 058fe1c044 ("perf/core: Make cgroup switch visit only cpuctxs with cgroup events")
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220204004057.2961252-1-song@kernel.org
2022-02-06 22:37:27 +01:00
Tadeusz Struk
13765de814 sched/fair: Fix fault in reweight_entity
Syzbot found a GPF in reweight_entity. This has been bisected to
commit 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid
sched_task_group")

There is a race between sched_post_fork() and setpriority(PRIO_PGRP)
within a thread group that causes a null-ptr-deref in
reweight_entity() in CFS. The scenario is that the main process spawns
number of new threads, which then call setpriority(PRIO_PGRP, 0, -20),
wait, and exit.  For each of the new threads the copy_process() gets
invoked, which adds the new task_struct and calls sched_post_fork()
for it.

In the above scenario there is a possibility that
setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread
in the group that is just being created by copy_process(), and for
which the sched_post_fork() has not been executed yet. This will
trigger a null pointer dereference in reweight_entity(), as it will
try to access the run queue pointer, which hasn't been set.

Before the mentioned change the cfs_rq pointer for the task  has been
set in sched_fork(), which is called much earlier in copy_process(),
before the new task is added to the thread_group.  Now it is done in
the sched_post_fork(), which is called after that.  To fix the issue
the remove the update_load param from the update_load param() function
and call reweight_task() only if the task flag doesn't have the
TASK_NEW flag set.

Fixes: 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org
2022-02-06 22:37:26 +01:00
Linus Torvalds
c3bf8a1440 perf/urgent contains 3 fixups:
- Intel/PT: filters could crash the kernel
 
  - Intel: default disable the PMU for SMM, some new-ish EFI firmware has
    started using CPL3 and the PMU CPL filters don't discriminate against
    SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
    cycles.
 
  - Fixup for perf_event_attr::sig_data
 
 (Peter Zijlstra)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH/vpgACgkQEsHwGGHe
 VUqIGw/9EWg7Ek89BG9ZZui8EEDAzx3x0s/tyxiz0z18YfvtTnex5I87uJUYpw2s
 hFhxxmGN+rwhcMGQDc0sDLLLxp170Yg0383N6OBBBMWPtNyxMWihBOHQgz8hQzbW
 KtwoiBewmvAycHw0aoOtDMqFZTn5RToONnG9h7yV9rUIGKq75XNh72MBy9sCSE2F
 w8lA3WWVTrv91YTPSMbsrm/tMC6eQCRiJGRMHTapxrWxkVu/H8O42pxJgS6dlo+h
 vw025hXcf0KGBLzwVSHYdZg8jMn7uD2oSMh+wQ+Jy15XjKVWDfF1m3sA5S+zSJsS
 THHtmqni5mF5xn0H7eOK9nYmRXR013zx6weo9miK4SN1pcoJq+PTNdSZOIwBm3Nh
 eUXR/bXFYL0GGuPOk0QHA9AjqbCBPrkiw1nfppbJem2rrZ0uKHyKa8REVcVg/Xzy
 e/nDy8I2y2bnwU9Ugk9BNWBRmn54Q2kb4/egmtLME6oYiqOXumQ4ZB/CmwRaSwxG
 bB99/tBKblrWSA6wcgATkqYFSg4ZJniDxKipnrEYX8ePkGODKHoIQS4EUyjxuPW/
 fO2G4Oe8aO/qYS/yei8XcubyEFaSPyUo+th+ZiPODCt15JKzQCAxeOYxqnEI4I4s
 5afDBmAo47bs9Eem7GRjZOgrDOP88+lISZ1rZidp5paDwWAmL2E=
 =0tH5
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Intel/PT: filters could crash the kernel

 - Intel: default disable the PMU for SMM, some new-ish EFI firmware has
   started using CPL3 and the PMU CPL filters don't discriminate against
   SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
   cycles.

 - Fixup for perf_event_attr::sig_data

* tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/pt: Fix crash with stop filters in single-range mode
  perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures
  selftests/perf_events: Test modification of perf_event_attr::sig_data
  perf: Copy perf_event_attr::sig_data on modification
  x86/perf: Default set FREEZE_ON_SMI for all
2022-02-06 10:11:14 -08:00
Matteo Croce
e70e13e7d4 bpf: Implement bpf_core_types_are_compat().
Adopt libbpf's bpf_core_types_are_compat() for kernel duty by adding
explicit recursion limit of 2 which is enough to handle 2 levels of
function prototypes.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204005519.60361-2-mcroce@linux.microsoft.com
2022-02-04 11:26:26 -08:00
Kevin Hao
53725c4cbd cpufreq: schedutil: Use to_gov_attr_set() to get the gov_attr_set
The to_gov_attr_set() has been moved to the cpufreq.h, so use it to get
the gov_attr_set.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-02-04 19:22:34 +01:00
Jakub Kicinski
c59400a68c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 17:36:16 -08:00
Kees Cook
dcb85f85fa gcc-plugins/stackleak: Use noinstr in favor of notrace
While the stackleak plugin was already using notrace, objtool is now a
bit more picky.  Update the notrace uses to noinstr.  Silences the
following objtool warnings when building with:

CONFIG_DEBUG_ENTRY=y
CONFIG_STACK_VALIDATION=y
CONFIG_VMLINUX_VALIDATION=y
CONFIG_GCC_PLUGIN_STACKLEAK=y

  vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section

Note that the plugin's addition of calls to stackleak_track_stack() from
noinstr functions is expected to be safe, as it isn't runtime
instrumentation and is self-contained.

Cc: Alexander Popov <alex.popov@linux.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-03 17:02:21 -08:00
Linus Torvalds
eb2eb5161c Networking fixes for 5.17-rc3, including fixes from bpf, netfilter,
and ieee802154.
 
 Current release - regressions:
 
  - Partially revert "net/smc: Add netlink net namespace support",
    fix uABI breakage
 
  - netfilter:
      - nft_ct: fix use after free when attaching zone template
      - nft_byteorder: track register operations
 
 Previous releases - regressions:
 
  - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
 
  - phy: qca8081: fix speeds lower than 2.5Gb/s
 
  - sched: fix use-after-free in tc_new_tfilter()
 
 Previous releases - always broken:
 
  - tcp: fix mem under-charging with zerocopy sendmsg()
 
  - tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
 
  - neigh: do not trigger immediate probes on NUD_FAILED from
    neigh_managed_work, avoid a deadlock
 
  - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
    false-positives
 
  - netfilter: nft_reject_bridge: fix for missing reply from prerouting
 
  - smc: forward wakeup to smc socket waitqueue after fallback
 
  - ieee802154:
      - return meaningful error codes from the netlink helpers
      - mcr20a: fix lifs/sifs periods
      - at86rf230, ca8210: stop leaking skbs on error paths
 
  - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
 
  - ax25: add refcount in ax25_dev to avoid UAF bugs
 
  - eth: mlx5e:
      - fix SFP module EEPROM query
      - fix broken SKB allocation in HW-GRO
      - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
 
  - eth: amd-xgbe:
      - fix skb data length underflow
      - ensure reset of the tx_timer_active flag, avoid Tx timeouts
 
  - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
 
  - eth: e1000e: handshake with CSME starts from Alder Lake platforms
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmH8X9UACgkQMUZtbf5S
 IrsxuhAAlAvFHGL6y5Y2gAmhKvVUvCYjiIJBcvk7R66CwYVRxofvlhmxi6GM/Czs
 9SrVSaN4RXu3p3d7UtAl1gAQwHqzLIHH3m2g5dSKVvHZWQgkm/+n74x0aZQ9Fll7
 mWs9uu5fWsQr/qZBnnjoQTvUxRUNVd4trBy7nXGzkNqJL5j0+2TT4BhH4qalhE28
 iPc9YFCyKPdjoWFksteZqD3hAQbXxK/xRRr6xuvFHENlZdEHM6ARftHnJthTG/fY
 32rdn9YUkQ9lNtOBJNMN9yP2z1B7TcxASBqjjk55I7XtT1QAI9/PskszavHC0hOk
 BCSMX779bLNW4+G0wiSKVB4tq4tvswtawq8Hxa6zdU4TKIzfQ84ZL/Nf66GtH+4W
 C0mbZohmyJV9hQFkNT0ZLeihljd7i4BkDttlbK3uz2IL9tHeX3uSo5V7AgS/Xaf6
 frXgbGgjQTaR6IL9AUhfN3GTCx60mzpH/aRpFho8A5xAl3EtHWCJcRhbY/CEhQBR
 zyCndcLcG5mUzbhx/TxlKrrpRCLxqCUG/Tsb2wCh5jMxO1zonW9Hhv4P1ie6EFuI
 h+XiJT2WWObS/KTze9S86WOR0zcqrtRqaOGJlNB+/+K8ClZU8UsDTFXLQ0dqpVZF
 Mvp7VchBzyFFJrrvO8WkkJgLTKdaPJmM9wuWUZb4J6d2MWlmDkE=
 =qKvf
 -----END PGP SIGNATURE-----

Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter, and ieee802154.

  Current release - regressions:

   - Partially revert "net/smc: Add netlink net namespace support", fix
     uABI breakage

   - netfilter:
      - nft_ct: fix use after free when attaching zone template
      - nft_byteorder: track register operations

  Previous releases - regressions:

   - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback

   - phy: qca8081: fix speeds lower than 2.5Gb/s

   - sched: fix use-after-free in tc_new_tfilter()

  Previous releases - always broken:

   - tcp: fix mem under-charging with zerocopy sendmsg()

   - tcp: add missing tcp_skb_can_collapse() test in
     tcp_shift_skb_data()

   - neigh: do not trigger immediate probes on NUD_FAILED from
     neigh_managed_work, avoid a deadlock

   - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
     false-positives

   - netfilter: nft_reject_bridge: fix for missing reply from prerouting

   - smc: forward wakeup to smc socket waitqueue after fallback

   - ieee802154:
      - return meaningful error codes from the netlink helpers
      - mcr20a: fix lifs/sifs periods
      - at86rf230, ca8210: stop leaking skbs on error paths

   - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent

   - ax25: add refcount in ax25_dev to avoid UAF bugs

   - eth: mlx5e:
      - fix SFP module EEPROM query
      - fix broken SKB allocation in HW-GRO
      - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows

   - eth: amd-xgbe:
      - fix skb data length underflow
      - ensure reset of the tx_timer_active flag, avoid Tx timeouts

   - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()

   - eth: e1000e: handshake with CSME starts from Alder Lake platforms"

* tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  ax25: fix reference count leaks of ax25_dev
  net: stmmac: ensure PTP time register reads are consistent
  net: ipa: request IPA register values be retained
  dt-bindings: net: qcom,ipa: add optional qcom,qmp property
  tools/resolve_btfids: Do not print any commands when building silently
  bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
  net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
  tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
  net: sparx5: do not refer to skb after passing it on
  Partially revert "net/smc: Add netlink net namespace support"
  net/mlx5e: Avoid field-overflowing memcpy()
  net/mlx5e: Use struct_group() for memcpy() region
  net/mlx5e: Avoid implicit modify hdr for decap drop rule
  net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
  net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
  net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
  net/mlx5: E-Switch, Fix uninitialized variable modact
  net/mlx5e: Fix handling of wrong devices during bond netevent
  net/mlx5e: Fix broken SKB allocation in HW-GRO
  net/mlx5e: Fix wrong calculation of header index in HW_GRO
  ...
2022-02-03 16:54:18 -08:00
Jakub Kicinski
77b1b8b43e Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2022-02-03

We've added 6 non-merge commits during the last 10 day(s) which contain
a total of 7 files changed, 11 insertions(+), 236 deletions(-).

The main changes are:

1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
   flag which otherwise trips over KASAN, from Hou Tao.

2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
   rename, from Alexei Starovoitov.

3) Fix a possible race in inc_misses_counter() when IRQ would trigger
   during counter update, from He Fengqing.

4) Fix tooling infra for cross-building with clang upon probing whether
   gcc provides the standard libraries, from Jean-Philippe Brucker.

5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.

6) Drop unneeded and outdated lirc.h header copy from tooling infra as
   BPF does not require it anymore, from Sean Young.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  tools/resolve_btfids: Do not print any commands when building silently
  bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
  tools: Ignore errors from `which' when searching a GCC toolchain
  tools headers UAPI: remove stale lirc.h
  bpf: Fix possible race in inc_misses_counter
  bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
====================

Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 13:42:38 -08:00
Yonghong Song
d7e7b42f4f bpf: Fix a btf decl_tag bug when tagging a function
syzbot reported a btf decl_tag bug with stack trace below:

  general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  CPU: 0 PID: 3592 Comm: syz-executor914 Not tainted 5.16.0-syzkaller-11424-gb7892f7d5cb2 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:btf_type_vlen include/linux/btf.h:231 [inline]
  RIP: 0010:btf_decl_tag_resolve+0x83e/0xaa0 kernel/bpf/btf.c:3910
  ...
  Call Trace:
   <TASK>
   btf_resolve+0x251/0x1020 kernel/bpf/btf.c:4198
   btf_check_all_types kernel/bpf/btf.c:4239 [inline]
   btf_parse_type_sec kernel/bpf/btf.c:4280 [inline]
   btf_parse kernel/bpf/btf.c:4513 [inline]
   btf_new_fd+0x19fe/0x2370 kernel/bpf/btf.c:6047
   bpf_btf_load kernel/bpf/syscall.c:4039 [inline]
   __sys_bpf+0x1cbb/0x5970 kernel/bpf/syscall.c:4679
   __do_sys_bpf kernel/bpf/syscall.c:4738 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:4736 [inline]
   __x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4736
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

The kasan error is triggered with an illegal BTF like below:
   type 0: void
   type 1: int
   type 2: decl_tag to func type 3
   type 3: func to func_proto type 8
The total number of types is 4 and the type 3 is illegal
since its func_proto type is out of range.

Currently, the target type of decl_tag can be struct/union, var or func.
Both struct/union and var implemented their own 'resolve' callback functions
and hence handled properly in kernel.
But func type doesn't have 'resolve' callback function. When
btf_decl_tag_resolve() tries to check func type, it tries to get
vlen of its func_proto type, which triggered the above kasan error.

To fix the issue, btf_decl_tag_resolve() needs to do btf_func_check()
before trying to accessing func_proto type.
In the current implementation, func type is checked with
btf_func_check() in the main checking function btf_check_all_types().
To fix the above kasan issue, let us implement 'resolve' callback
func type properly. The 'resolve' callback will be also called
in btf_check_all_types() for func types.

Fixes: b5ea834dde ("bpf: Support for new btf kind BTF_KIND_TAG")
Reported-by: syzbot+53619be9444215e785ed@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220203191727.741862-1-yhs@fb.com
2022-02-03 13:06:04 -08:00
Mickaël Salaün
1f2cfdd349 printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin()
The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
kernel/printk/sysctl.c introduced an incorrect __user attribute to the
buffer argument.  I spotted this change in [1] as well as the kernel
test robot.  Revert this change to please sparse:

  kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
  kernel/printk/sysctl.c:20:51:    expected void *
  kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer

Fixes: faaa357a55 ("printk: move printk sysctl to printk/sysctl.c")
Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net [1]
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-03 11:27:38 -08:00
Igor Pylypiv
67d6212afd Revert "module, async: async_synchronize_full() on module init iff async is used"
This reverts commit 774a1221e8.

We need to finish all async code before the module init sequence is
done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule().  Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked.  This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().

For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:

	if (cpu < nr_cpu_ids)
		error = work_on_cpu(cpu, local_pci_probe, &ddi);
	else
		error = local_pci_probe(&ddi);

We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread.  As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.

The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)

modprobe pm80xx                      worker
...
  do_init_module()
  ...
    pci_call_probe()
      work_on_cpu(local_pci_probe)
                                     local_pci_probe()
                                       pm8001_pci_probe()
                                         scsi_scan_host()
                                           async_schedule()
                                           worker->flags |= PF_USED_ASYNC;
                                     ...
      < return from worker >
  ...
  if (current->flags & PF_USED_ASYNC) <--- false
  	async_synchronize_full();

Commit 21c3c5d280 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e8
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.

Since commit 0fdff3ec6d ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.

Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.

Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-03 11:20:34 -08:00
Linus Torvalds
305e6c42e8 Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Eric's fix for a long standing cgroup1 permission issue where it only
   checks for uid 0 instead of CAP which inadvertently allows
   unprivileged userns roots to modify release_agent userhelper

 - Fixes for the fallout from Waiman's recent cpuset work

* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
  cgroup-v1: Require capabilities to set release_agent
  cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
  cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
2022-02-03 08:15:13 -08:00
Waiman Long
2bdfd2825c cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks().  It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.

Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.

Fixes: 4716909cc5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-03 05:59:01 -10:00
Hou Tao
b293dcc473 bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
after mapping"), non-VM_ALLOC mappings will be marked as accessible
in __get_vm_area_node() when KASAN is enabled. But now the flag for
ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
after vmap() returns. Because the ringbuf area is created by mapping
allocated pages, so use VM_MAP instead.

After the change, info in /proc/vmallocinfo also changes from
  [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
to
  [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user

Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
2022-02-02 23:15:24 -08:00
Changbin Du
fe13889c39 genirq, softirq: Use in_hardirq() instead of in_irq()
Replace the obsolete and ambiguos macro in_irq() with the new macro
in_hardirq().

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220128110727.5110-1-changbin.du@gmail.com
2022-02-02 21:34:19 +01:00
Christoph Hellwig
aa8dcccaf3 block: check that there is a plug in blk_flush_plug
Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
the NULL check instead of open coding that check everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:50:00 -07:00
Christoph Hellwig
b1f866b013 block: remove blk_needs_flush_plug
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well.  Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:50:00 -07:00
Christoph Hellwig
07888c665b block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment.  NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Christoph Hellwig
322cbb50de block: remove genhd.h
There is no good reason to keep genhd.h separate from the main blkdev.h
header that includes it.  So fold the contents of genhd.h into blkdev.h
and remove genhd.h entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Adrian Hunter
58b2ff2c18 perf/core: Allow kernel address filter when not filtering the kernel
The so-called 'kernel' address filter can also be useful for filtering
fixed addresses in user space.  Allow that.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-6-adrian.hunter@intel.com
2022-02-02 13:11:43 +01:00
Adrian Hunter
d680ff24e9 perf/core: Fix address filter parser for multiple filters
Reset appropriate variables in the parser loop between parsing separate
filters, so that they do not interfere with parsing the next filter.

Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-4-adrian.hunter@intel.com
2022-02-02 13:11:42 +01:00
Marco Elver
3c25fc97f5 perf: Copy perf_event_attr::sig_data on modification
The intent has always been that perf_event_attr::sig_data should also be
modifiable along with PERF_EVENT_IOC_MODIFY_ATTRIBUTES, because it is
observable by user space if SIGTRAP on events is requested.

Currently only PERF_TYPE_BREAKPOINT is modifiable, and explicitly copies
relevant breakpoint-related attributes in hw_breakpoint_copy_attr().
This misses copying perf_event_attr::sig_data.

Since sig_data is not specific to PERF_TYPE_BREAKPOINT, introduce a
helper to copy generic event-type-independent attributes on
modification.

Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20220131103407.1971678-1-elver@google.com
2022-02-02 13:11:40 +01:00
Zhen Ni
c8eaf6ac76 sched: move autogroup sysctls into its own file
move autogroup sysctls to autogroup.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220128095025.8745-1-nizhen@uniontech.com
2022-02-02 13:11:37 +01:00
Mathieu Desnoyers
bfdf4e6208 rseq: Remove broken uapi field layout on 32-bit little endian
The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
entirely wrong on 32-bit little endian: a preprocessor logic mistake
wrongly uses the big endian field layout on 32-bit little endian
architectures.

Fortunately, those ptr32 accessors were never used within the kernel,
and only meant as a convenience for user-space.

Remove those and replace the whole rseq_cs union by a __u64 type, as
this is the only thing really needed to express the ABI. Document how
32-bit architectures are meant to interact with this field.

Fixes: ec9c82e03a ("rseq: uapi: Declare rseq_cs field as union, update includes")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220127152720.25898-1-mathieu.desnoyers@efficios.com
2022-02-02 13:11:34 +01:00
Waiman Long
fc153c1c58 clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
A watchdog maximum skew of 100us may still be too small for
some systems or archs. It may also be too small when some kernel
debug config options are enabled.  So add a new Kconfig option
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
control on the threshold for marking clocksource as unstable.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:35:43 -08:00
Paul E. McKenney
9c0f1c7fd7 rcutorture: Enable limited callback-flooding tests of SRCU
This commit allows up to 50,000 callbacks worth of callback-flooding
tests of SRCU.  The goal of this change is to exercise Tree SRCU's
ability to transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG triggered
by callback-queue-time lock contention.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:39 -08:00
Paul E. McKenney
6b8646a9d3 torture: Wake up kthreads after storing task_struct pointer
Currently, _torture_create_kthread() uses kthread_run() to create
torture-test kthreads, which means that the resulting task_struct
pointer is stored after the newly created kthread has been marked
runnable.  This in turn can cause spurious failure of checks for
code being run by a particular kthread.  This commit therefore changes
_torture_create_kthread() to use kthread_create(), then to do an explicit
wake_up_process() after the task_struct pointer has been stored.

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:39 -08:00
Paul E. McKenney
89440d2dad rcutorture: Fix rcu_fwd_mutex deadlock
The rcu_torture_fwd_cb_hist() function acquires rcu_fwd_mutex, but is
invoked from rcutorture_oom_notify() function, which hold this same
mutex across this call.  This commit fixes the resulting deadlock.

Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:39 -08:00
Paul E. McKenney
02b51a1cf4 rcutorture: Add end-of-test check to rcu_torture_fwd_prog() loop
The second and subsequent forward-progress kthreads loop waiting for
the first forward-progress kthread to start the next test interval.
Unfortunately, if the test ends while one of those kthreads is waiting,
the test will hang.  This hang occurs because that wait loop fails to
check for the end of the test.  This commit therefore adds an end-of-test
check to that wait loop.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
e22ef8df41 rcutorture: Make rcu_fwd_cb_nodelay be a counter
Back when only one rcutorture kthread could do forward-progress testing,
it was just fine for rcu_fwd_cb_nodelay to be a non-atomic bool.  It was
set at the start of forward-progress testing and cleared at the end.
But now that there are multiple threads, the value can be cleared while
one of the threads is still doing forward-progress testing.  This commit
therefore makes rcu_fwd_cb_nodelay be an atomic counter, replacing the
WRITE_ONCE() operations with atomic_inc() and atomic_dec().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
05b724655b rcutorture: Increase visibility of forward-progress hangs
This commit adds a few pr_alert() calls to rcutorture's forward-progress
testing in order to better diagnose shutdown-time hangs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
2b4a7f20f1 torture: Distinguish kthread stopping and being asked to stop
Right now, if a given kthread (call it "kthread") realizes that it needs
to stop, "Stopping kthread" is written to the console.  When the cleanup
code decides that it is time to stop that kthread, "Stopping kthread
tasks" is written to the console.  These two events might happen in
either order, especially in the case of time-based torture-test shutdown.

But it is hard to distinguish these, especially for those unfamiliar with
the torture tests.  This commit therefore changes the first case from
"Stopping kthread" to "kthread is stopping" to make things more clear.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Paul E. McKenney
6f81bd6a4e rcutorture: Print message before invoking ->cb_barrier()
The various ->cb_barrier() functions, for example, rcu_barrier(),
sometimes cause rcutorture hangs.  But currently, the last console
message is the unenlightening "Stopping rcu_torture_stats".  This commit
therefore prints a message of the form "rcu_torture_cleanup: Invoking
rcu_barrier+0x0/0x1e0()" to help point people in the right direction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:24:38 -08:00
Zqiang
c951587585 rcu: Add per-CPU rcuc task dumps to RCU CPU stall warnings
When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings.  This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.

Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:22:17 -08:00
Paul E. McKenney
10c5357874 rcu: Don't deboost before reporting expedited quiescent state
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx
before reporting the expedited quiescent state.  Under heavy real-time
load, this can result in this function being preempted before the
quiescent state is reported, which can in turn prevent the expedited grace
period from completing.  Tim Murray reports that the resulting expedited
grace periods can take hundreds of milliseconds and even more than one
second, when they should normally complete in less than a millisecond.

This was fine given that there were no particular response-time
constraints for synchronize_rcu_expedited(), as it was designed
for throughput rather than latency.  However, some users now need
sub-100-millisecond response-time constratints.

This patch therefore follows Neeraj's suggestion (seconded by Tim and
by Uladzislau Rezki) of simply reversing the two operations.

Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Joel Fernandes <joelaf@google.com>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Tim Murray <timmurray@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:41 -08:00
Alison Chaiken
c8b16a6526 rcu: Elevate priority of offloaded callback threads
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value.  With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed.  While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads.  Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:25 -08:00
Alison Chaiken
54577e23fa rcu: Make priority of grace-period thread consistent
The priority of RCU grace period threads is set to kthread_prio when
they are launched from rcu_spawn_gp_kthread().  The same is not true
of rcu_spawn_one_nocb_kthread().  Accordingly, add priority elevation
to rcu_spawn_one_nocb_kthread().

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:17 -08:00
Alison Chaiken
c8db27dd0e rcu: Move kthread_prio bounds-check to a separate function
Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.

Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Zqiang
4b4399b245 rcu: Create per-cpu rcuc kthreads only when rcutree.use_softirq=0
The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y.  This results in "rcuc"
kthreads being created that are never actually used.  This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Neeraj Upadhyay
eae9f147a4 rcu: Remove unused rcu_state.boost
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Neeraj Upadhyay
02e3024175 rcu/nocb: Handle concurrent nocb kthreads creation
When multiple CPUs in the same nocb gp/cb group concurrently
come online, they might try to concurrently create the same
rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to
provide mutual exclusion for the rcuog kthread creation code.

[ paulmck: Whitespace fixes per kernel test robot feedback. ]

Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Paul E. McKenney
a47f9f131d rcu: Mark accesses to boost_starttime
The boost_starttime shared variable has conflicting unmarked C-language
accesses, which are dangerous at best.  This commit therefore adds
appropriate marking.  This was found by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:19:02 -08:00
Paul E. McKenney
63c564da11 rcu: Mark ->expmask access in synchronize_rcu_expedited_wait()
This commit adds a READ_ONCE() to an access to the rcu_node structure's
->expmask field to prevent compiler mischief.  Detected by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:05:10 -08:00
Neeraj Upadhyay
4d266c247d rcu/exp: Fix check for idle context in rcu_exp_handler
For PREEMPT_RCU, the rcu_exp_handler() function checks
whether the current CPU is in idle, by calling
rcu_dynticks_curr_cpu_in_eqs(). However, rcu_exp_handler()
is called in IPI handler context. So, it should be checking
the idle context using rcu_is_cpu_rrupt_from_idle(). Fix this
by using rcu_is_cpu_rrupt_from_idle() instead of
rcu_dynticks_curr_cpu_in_eqs(). Non-preempt configuration
already uses the correct check.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-02-01 17:05:10 -08:00
Alexei Starovoitov
e96f2d64c8 bpf: Drop libbpf, libelf, libz dependency from bpf preload.
Drop libbpf, libelf, libz dependency from bpf preload.
This reduces bpf_preload_umd binary size
from 1.7M to 30k unstripped with debug info
and from 300k to 19k stripped.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-8-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
18ef5dac93 bpf: Open code obj_get_info_by_fd in bpf preload.
Open code obj_get_info_by_fd in bpf preload.
It's the last part of libbpf that preload/iterators were using.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-7-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
79b203926d bpf: Convert bpf preload to light skeleton.
Convert bpffs preload iterators to light skeleton.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-6-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Alexei Starovoitov
1ddbddd706 bpf: Remove unnecessary setrlimit from bpf preload.
BPF programs and maps are memcg accounted. setrlimit is obsolete.
Remove its use from bpf preload.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220131220528.98088-5-alexei.starovoitov@gmail.com
2022-02-01 23:56:18 +01:00
Linus Torvalds
61fda95541 audit/stable-5.17 PR 20220131
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmH4I8oUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXP9zw//ZAU42ylQXvGfLJbCsbZHKq7H/Ljo
 QKbwlSKh+aP+IXcRxnh9fV3vTkc7LkHZfrhGx35aHlS1HK8XIa/XHHNTPHfaEKai
 iDiXwICP5j1JysWCKJXu0uhe7juW3ko7+bQyI8MYBAeLKrbC1F04oswSgVNuX2bq
 aMyD9+GdAv7PiVVw0oc5+hKFS/8q/VRdbCsJmESKUDhthS3hqH8wZJqy37aYIpF1
 /yqkvEIts+GzeQrSWvsGL+O720GIyZ8V2/cEH2y+pnGYgoYdlXhD1CccXOOCocb8
 M/6uQZTgQiEVf1LMuu/WIW6CzrRQIjOt4SNU0cXLkWlgxAN1p5b9sP+YcncWU23N
 zbIACFiFVz1ZhxHT0AEVs+thdZrF6CJX0xfsb+GvJJeYy9aw11s7VGlYhaM+1haG
 8oeYmtjQ+rjjkEKMUcOWQYxRvCZIsI6z5JlqoFC0zuJda1k3418LyDZARwCUrm1c
 6QY35M0HHxa8k7TWtvJ6aopxM4pg+ZL8WS0shULHRqw/NprvYc6KeZc0/VNyFojJ
 S4wa+Z6rXoYIvGReeDkUOiJjigW78/kyQR2rsxHWaGlDfL8+bALdnorkTbU2G4oG
 Jl1HQdOSuAeAT/D7w/UmKFXDBPXHU77sLnjxiCDMmbmVY6Vxja/hZD/wy9RyyK4K
 UAXDJNwhndHsgX4=
 =sFi4
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20220131' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A single audit patch to fix problems relating to audit queuing and
  system responsiveness when "audit=1" is specified on the kernel
  command line and the audit daemon is SIGSTOP'd for an extended period
  of time"

* tag 'audit-pr-20220131' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: improve audit queue handling when "audit=1" on cmdline
2022-02-01 11:07:09 -08:00
Eric W. Biederman
24f6008564 cgroup-v1: Require capabilities to set release_agent
The cgroup release_agent is called with call_usermodehelper.  The function
call_usermodehelper starts the release_agent with a full set fo capabilities.
Therefore require capabilities when setting the release_agaent.

Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Fixes: 81a6a5cdd2 ("Task Control Groups: automatic userspace notification of idle cgroups")
Cc: stable@vger.kernel.org # v2.6.24+
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-01 07:28:00 -10:00
Kenta Tada
0407a65f35 bpf: make bpf_copy_from_user_task() gpl only
access_process_vm() is exported by EXPORT_SYMBOL_GPL().

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Link: https://lore.kernel.org/r/20220128170906.21154-1-Kenta.Tada@sony.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-31 12:44:37 -08:00
Yury Norov
1c4cafd115 padata: replace cpumask_weight with cpumask_empty in padata.c
padata_do_parallel() calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-01-31 11:21:46 +11:00
Linus Torvalds
27a96c4feb - Prevent accesses to the per-CPU cgroup context list from another CPU
except the one it belongs to, to avoid list corruption
 
 - Make sure parent events are always woken up to avoid indefinite hangs
 in the traced workload
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH2cukACgkQEsHwGGHe
 VUrQZA//WGHM7aNDp9xNkEhSsH6tDZqYSzg8yxzQGrBx3hNuTraj1imHVsCXLeU1
 yCS5hBr3LdnhsAL+IaNtgTfOyM7m9SLtvFWTLINY2j7L/DkljagIB3Ur+8No565U
 a/gd1UbeW2btRYUidFmXokIXMxMcsHCawX1hf1FKm7tP2b+DvD9uGePyxHmwk9BB
 o5HTc1ft6qPt7LFfTxrtXEg76Z8DFovusvaPgISmv5RFnvLLwMnY6CN/oMUP+Fr+
 PyCxFatgscEDgiDqRWNYvDyKOBV9dDa72Pab/yFoREM7Pc88sLhy9ZSPaf8ZiJDz
 fvqBVl9d93i2UExNSdedC4dq8l3s0ioHOd4soSdBTVkrUxYTPvx31WfNLUigNxws
 q7vJulsOUCPApuRvybTBy8A12ksjlT8qNDC3RLYfEKNuNMeYzXiZnijJ98FdVirF
 FFv+5pPCzz1dhcsu9rMyuNgFwaIik8ptnt4ffaTsy70mT5h1UmKZsLr3clurRX8v
 XpzAVCcJKiDBbc4NdtweQqm2zoY9Uxwhez3+wxmlZXndOPdcYbACKQ65U0bl6OYw
 Qir8WZMOzHnlJLItPuH+TMuoJ7TdFwD9Sndiih+sKK7PEfU3XvWIRATvyjCoPINA
 FfqX59lLDVBpjapdODVCt7St9e7VKAxPuc9yZIBJOo7K0QRLquE=
 =+ssI
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Prevent accesses to the per-CPU cgroup context list from another CPU
   except the one it belongs to, to avoid list corruption

 - Make sure parent events are always woken up to avoid indefinite hangs
   in the traced workload

* tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix cgroup event list management
  perf: Always wake the parent event
2022-01-30 15:02:32 +02:00
Linus Torvalds
24f4db1f3a - Make sure the membarrier-rseq fence commands are part of the reported
set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH2cDYACgkQEsHwGGHe
 VUo7YhAAs4trrOaq5KCr2YW7agAHqYjYsMfPLEBUOvJ4hGggMAYCwGXYDdzMo8UH
 tAxLfh2fyy3VUNKPeLMqz6XVn697oTQnSAkVBDJIcMJlctZGvXmwNnv3g133xlML
 n4xLhVAfC76HA0Q0zkPZ6gdN/VBrRl94n3mAWxB9FVFUDwFIMr6PkGrjw9Olml/r
 Xm2WUuKfvwHqEhNUDu2rS0qqal3mPO6DV+4Y7JGyQL5fYAu3HfabZV4CBfYt5z3P
 m3y4Y2yfpWcB4D6KfXHPQwLhWOoxVZ0X2YiAjfj8x/+dYaTIoBSg/EG+44nX6tBV
 RcYK9gARwfCgSdcfWZFYYfHlGbNv1x/HHOVvkuJEzfCya4+FQTRdrqkoOWsPUPk1
 jqa9Ybz9wQxJ2GI57wT7W+fyl2M+agyvwywrELLy6w6nwAKdWpbIGW4b1ummfIk2
 MGqL99Hge6aX9ONFT6IA4rHd0fvWNC00l4evzNVfyHQnYN3f0ul8h8p950F09I25
 apSalhyz40LbJdKsRFGt8CjIgM2Y+rP87JI8ZawjauWFIp+lLiXcLOuLXTNJjAe6
 Sw9EWkkr3sTHzOxudrFHz/QwM+m7KkoYnuGQw0gDqzXdqc1Gpc91llLqqWNuyX++
 y3NWzEdmnjlV+H1Fls94UhrdsHAiI8d+OTH3fQVY6VEtGcLQJ2c=
 =kI2Z
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Borislav Petkov:
 "Make sure the membarrier-rseq fence commands are part of the reported
  set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY"

* tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
2022-01-30 13:09:00 +02:00
Suren Baghdasaryan
44585f7bc0 psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
When CONFIG_PROC_FS is disabled psi code generates the following
warnings:

  kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
      1364 | static const struct proc_ops psi_cpu_proc_ops = {
           |                              ^~~~~~~~~~~~~~~~
  kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
      1355 | static const struct proc_ops psi_memory_proc_ops = {
           |                              ^~~~~~~~~~~~~~~~~~~
  kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
      1346 | static const struct proc_ops psi_io_proc_ops = {
           |                              ^~~~~~~~~~~~~~~

Make definitions of these structures and related functions conditional
on CONFIG_PROC_FS config.

Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-30 09:56:58 +02:00
Linus Torvalds
a7b4b0076b Power management updates for 5.17-rc2
- Make the buffer handling in pm_show_wakelocks() more robust by
    using sysfs_emit_at() in it to generate output (Greg Kroah-Hartman).
 
  - Drop register_nosave_region_late() which is not used (Amadeusz
    Sławiński).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmHz6CASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxwEMP/2WnqXjyNvhpw8x0qqz7qHU+2SJuqZBT
 FRyY574tGQlBqsatfJAvWVtC8hFOHAKClKNinxemwXNzHHUv5era4WNTu0xvdykf
 w6pCgHfHGd+eHB9hWD4Q4lXL3ohv16A3859ldprChN0vygmnoPOBXcvJMO277dNn
 b1IobLhr3HMNGK3d4eZY/+xN2SP/dOyJmzK+R8reUGwzVa52PNPii0jxF0N2FyWg
 hKIUKNnryQ56TMJyNbpJM47zOe7IgN7YGjSuI1Vx8P5HfSYCmQ+EVmv7qhN6THX/
 s1ONWUwZxn/8cgGX/5XCK9b4cAwHZZ2peC/AWueL4qFmXgHHVNRC2rm5Ywn2eVFd
 pyyNsZGWLXxpS02g69CR32fiUn3Hh/YKFRDOtg/nibTJ1Gu2xFeGo8xBYqDp26Zl
 z2squ2zWBLxnT6cg2ie9mjy7KTcIcCecWiEv20n3b3qQ+kd2e0AKvV+LN8IeisFI
 q1cHnAwzgG9oMayDMDl2LdzpH9UvxIGJyT8ucSFoShK2TlV/ZPGz0HkZHRLOvFPc
 VAhAP+E9dNr2bd11Eme9Ts02PvKeIJ7tCQ4xa/PfPwQXsg030VldQC+YXKO3CPga
 70EsSyk+ryFRG7nhTVqFVSdcnKKiUFHrxWtC7a/ePvPVCjvvkEeqnqam0dRisLhg
 HFDWrKFJ78i+
 =P7NO
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These make the buffer handling in pm_show_wakelocks() more robust and
  drop an unused hibernation-related function.

  Specifics:

   - Make the buffer handling in pm_show_wakelocks() more robust by
     using sysfs_emit_at() in it to generate output (Greg
     Kroah-Hartman).

   - Drop register_nosave_region_late() which is not used (Amadeusz
     Sławiński)"

* tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: Remove register_nosave_region_late()
  PM: wakeup: simplify the output logic of pm_show_wakelocks()
2022-01-28 20:44:07 +02:00