Commit Graph

45242 Commits

Author SHA1 Message Date
Linus Torvalds
b831f83e40 bpf-6.11-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbaYFMACgkQ6rmadz2v
 bTq7JBAAipwHeOL3IYproQxGy+f0W3Uik9FNlavSQ3zpJHmTJcpf0ysXkqH23g2q
 26CF0R44gmGMkdbZsxbk3HLI2qRmzxmznYCDH0g7d9qwzQMhFHIiY7TW7UD/XbKx
 UHdHLb5PYrj+j94T1WGiQdvbZYDlpmdz5rFA9K/TBtBArqYp9mA4D/cIlTDBfFpk
 cjhSGVl9x/BKbiHKApxSGcR7Fh/+ux9mVdlssWQNhRfm3V2tbRSAw1i1/ydTG+4c
 bf/m0RSIDfPMxy1i7D0lNRbclzWVisTqNzDXHfQoRUJMuMDfsK4UZB/6gvh+2LKy
 D60vT8AfN5ygjJbLdFbwFGnEymjfsXWguyqfQB0d9Hj/2/EsZ01rI2ikJv9J+qKl
 wwZM3YeA3Q/V0mZ5wCONp2dn+s+82nga+fdvCRFz6SLkWQwgbW5BYHFF1c60V9MH
 Pbd9Y5VfCOEZRzR6RxbmguPrnoU1+BUwQeIAp9L73bllrzhtmh/aL/b03uw8/wUh
 I+peLxJ+DVp6wTudgvSMviMySWcztuz397G7TnFyG0V4nKe1+QxSaQWWw2HKvpy3
 i+m98qoWqbuJqz49FpEtX6x/17gZZNA0LK648D77nrOfsGWOLTKOZUDbNWbTPw9a
 Gojg5obJ8P82yO9UCYQLyGsAJxJrKZv3OEmqy0mRG1hrSMsozxg=
 =5Quw
 -----END PGP SIGNATURE-----

Merge tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix crash when btf_parse_base() returns an error (Martin Lau)

 - Fix out of bounds access in btf_name_valid_section() (Jeongjun Park)

* tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add a selftest to check for incorrect names
  bpf: add check for invalid name in btf_name_valid_section()
  bpf: Fix a crash when btf_parse_base() returns an error pointer
2024-09-05 20:10:53 -07:00
Linus Torvalds
e4b42053b7 Tracing fixes for 6.11:
- Fix adding a new fgraph callback after function graph tracing has
   already started.
 
   If the new caller does not initialize its hash before registering the
   fgraph_ops, it can cause a NULL pointer dereference. Fix this by adding
   a new parameter to ftrace_graph_enable_direct() passing in the newly
   added gops directly and not rely on using the fgraph_array[], as entries
   in the fgraph_array[] must be initialized. Assign the new gops to the
   fgraph_array[] after it goes through ftrace_startup_subops() as that
   will properly initialize the gops->ops and initialize its hashes.
 
 - Fix a memory leak in fgraph storage memory test.
 
   If the "multiple fgraph storage on a function" boot up selftest
   fails in the registering of the function graph tracer, it will
   not free the memory it allocated for the filter. Break the loop
   up into two where it allocates the filters first and then registers
   the functions where any errors will do the appropriate clean ups.
 
 - Only clear the timerlat timers if it has an associated kthread.
 
   In the rtla tool that uses timerlat, if it was killed just as it
   was shutting down, the signals can free the kthread and the timer.
   But the closing of the timerlat files could cause the hrtimer_cancel()
   to be called on the already freed timer. As the kthread variable is
   is set to NULL when the kthreads are stopped and the timers are freed
   it can be used to know not to call hrtimer_cancel() on the timer if
   the kthread variable is NULL.
 
 - Use a cpumask to keep track of osnoise/timerlat kthreads
 
   The timerlat tracer can use user space threads for its analysis.
   With the killing of the rtla tool, the kernel can get confused
   between if it is using a user space thread to analyze or one of its
   own kernel threads. When this confusion happens, kthread_stop()
   can be called on a user space thread and bad things happen.
   As the kernel threads are per-cpu, a bitmask can be used to know
   when a kernel thread is used or when a user space thread is used.
 
 - Add missing interface_lock to osnoise/timerlat stop_kthread()
 
   The stop_kthread() function in osnoise/timerlat clears the
   osnoise kthread variable, and if it was a user space thread does
   a put_task on it. But this can race with the closing of the timerlat
   files that also does a put_task on the kthread, and if the race happens
   the task will have put_task called on it twice and oops.
 
 - Add cond_resched() to the tracing_iter_reset() loop.
 
   The latency tracers keep writing to the ring buffer without resetting
   when it issues a new "start" event (like interrupts being disabled).
   When reading the buffer with an iterator, the tracing_iter_reset()
   sets its pointer to that start event by walking through all the events
   in the buffer until it gets to the time stamp of the start event.
   In the case of a very large buffer, the loop that looks for the start
   event has been reported taking a very long time with a non preempt kernel
   that it can trigger a soft lock up warning. Add a cond_resched() into
   that loop to make sure that doesn't happen.
 
 - Use list_del_rcu() for eventfs ei->list variable
 
   It was reported that running loops of creating and deleting  kprobe events
   could cause a crash due to the eventfs list iteration hitting a LIST_POISON
   variable. This is because the list is protected by SRCU but when an item is
   deleted from the list, it was using list_del() which poisons the "next"
   pointer. This is what list_del_rcu() was to prevent.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZtohNBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtoNAQDQKjomYLCpLz2EqgHZ6VB81QVrHuqt
 cU7xuEfUJDzyyAEA/n0t6quIdjYRd6R2/KxGkP6By/805Coq4IZMTgNQmw0=
 =nZ7k
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix adding a new fgraph callback after function graph tracing has
   already started.

   If the new caller does not initialize its hash before registering the
   fgraph_ops, it can cause a NULL pointer dereference. Fix this by
   adding a new parameter to ftrace_graph_enable_direct() passing in the
   newly added gops directly and not rely on using the fgraph_array[],
   as entries in the fgraph_array[] must be initialized.

   Assign the new gops to the fgraph_array[] after it goes through
   ftrace_startup_subops() as that will properly initialize the
   gops->ops and initialize its hashes.

 - Fix a memory leak in fgraph storage memory test.

   If the "multiple fgraph storage on a function" boot up selftest fails
   in the registering of the function graph tracer, it will not free the
   memory it allocated for the filter. Break the loop up into two where
   it allocates the filters first and then registers the functions where
   any errors will do the appropriate clean ups.

 - Only clear the timerlat timers if it has an associated kthread.

   In the rtla tool that uses timerlat, if it was killed just as it was
   shutting down, the signals can free the kthread and the timer. But
   the closing of the timerlat files could cause the hrtimer_cancel() to
   be called on the already freed timer. As the kthread variable is is
   set to NULL when the kthreads are stopped and the timers are freed it
   can be used to know not to call hrtimer_cancel() on the timer if the
   kthread variable is NULL.

 - Use a cpumask to keep track of osnoise/timerlat kthreads

   The timerlat tracer can use user space threads for its analysis. With
   the killing of the rtla tool, the kernel can get confused between if
   it is using a user space thread to analyze or one of its own kernel
   threads. When this confusion happens, kthread_stop() can be called on
   a user space thread and bad things happen. As the kernel threads are
   per-cpu, a bitmask can be used to know when a kernel thread is used
   or when a user space thread is used.

 - Add missing interface_lock to osnoise/timerlat stop_kthread()

   The stop_kthread() function in osnoise/timerlat clears the osnoise
   kthread variable, and if it was a user space thread does a put_task
   on it. But this can race with the closing of the timerlat files that
   also does a put_task on the kthread, and if the race happens the task
   will have put_task called on it twice and oops.

 - Add cond_resched() to the tracing_iter_reset() loop.

   The latency tracers keep writing to the ring buffer without resetting
   when it issues a new "start" event (like interrupts being disabled).
   When reading the buffer with an iterator, the tracing_iter_reset()
   sets its pointer to that start event by walking through all the
   events in the buffer until it gets to the time stamp of the start
   event. In the case of a very large buffer, the loop that looks for
   the start event has been reported taking a very long time with a non
   preempt kernel that it can trigger a soft lock up warning. Add a
   cond_resched() into that loop to make sure that doesn't happen.

 - Use list_del_rcu() for eventfs ei->list variable

   It was reported that running loops of creating and deleting kprobe
   events could cause a crash due to the eventfs list iteration hitting
   a LIST_POISON variable. This is because the list is protected by SRCU
   but when an item is deleted from the list, it was using list_del()
   which poisons the "next" pointer. This is what list_del_rcu() was to
   prevent.

* tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
  tracing/timerlat: Only clear timer if a kthread exists
  tracing/osnoise: Use a cpumask to know what threads are kthreads
  eventfs: Use list_del_rcu() for SRCU protected list variable
  tracing: Avoid possible softlockup in tracing_iter_reset()
  tracing: Fix memory leak in fgraph storage selftest
  tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
2024-09-05 16:29:41 -07:00
Steven Rostedt
5bfbcd1ee5 tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
The timerlat interface will get and put the task that is part of the
"kthread" field of the osn_var to keep it around until all references are
released. But here's a race in the "stop_kthread()" code that will call
put_task_struct() on the kthread if it is not a kernel thread. This can
race with the releasing of the references to that task struct and the
put_task_struct() can be called twice when it should have been called just
once.

Take the interface_lock() in stop_kthread() to synchronize this change.
But to do so, the function stop_per_cpu_kthreads() needs to change the
loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
cpu_read_lock(), as the interface_lock can not be taken while the cpu
locks are held. The only side effect of this change is that it may do some
extra work, as the per_cpu variables of the offline CPUs would not be set
anyway, and would simply be skipped in the loop.

Remove unneeded "return;" in stop_kthread().

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 12:01:37 -04:00
Steven Rostedt
e6a53481da tracing/timerlat: Only clear timer if a kthread exists
The timerlat tracer can use user space threads to check for osnoise and
timer latency. If the program using this is killed via a SIGTERM, the
threads are shutdown one at a time and another tracing instance can start
up resetting the threads before they are fully closed. That causes the
hrtimer assigned to the kthread to be shutdown and freed twice when the
dying thread finally closes the file descriptors, causing a use-after-free
bug.

Only cancel the hrtimer if the associated thread is still around. Also add
the interface_lock around the resetting of the tlat_var->kthread.

Note, this is just a quick fix that can be backported to stable. A real
fix is to have a better synchronization between the shutdown of old
threads and the starting of new ones.

Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 11:30:23 -04:00
Steven Rostedt
177e1cc2f4 tracing/osnoise: Use a cpumask to know what threads are kthreads
The start_kthread() and stop_thread() code was not always called with the
interface_lock held. This means that the kthread variable could be
unexpectedly changed causing the kthread_stop() to be called on it when it
should not have been, leading to:

 while true; do
   rtla timerlat top -u -q & PID=$!;
   sleep 5;
   kill -INT $PID;
   sleep 0.001;
   kill -TERM $PID;
   wait $PID;
  done

Causing the following OOPS:

 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc76d1b-dirty #125 a533010b71dab205ad2f507188ce8c82203b0254
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:hrtimer_active+0x58/0x300
 Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
 RSP: 0018:ffff88811d97f940 EFLAGS: 00010202
 RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b
 RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28
 RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60
 R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d
 R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28
 FS:  0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? die_addr+0x40/0xa0
  ? exc_general_protection+0x154/0x230
  ? asm_exc_general_protection+0x26/0x30
  ? hrtimer_active+0x58/0x300
  ? __pfx_mutex_lock+0x10/0x10
  ? __pfx_locks_remove_file+0x10/0x10
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x8e/0x1f0
  ? security_file_release+0x43/0x80
  __fput+0x372/0xb10
  task_work_run+0x11e/0x1f0
  ? _raw_spin_lock+0x85/0xe0
  ? __pfx_task_work_run+0x10/0x10
  ? poison_slab_object+0x109/0x170
  ? do_exit+0x7a0/0x24b0
  do_exit+0x7bd/0x24b0
  ? __pfx_migrate_enable+0x10/0x10
  ? __pfx_do_exit+0x10/0x10
  ? __pfx_read_tsc+0x10/0x10
  ? ktime_get+0x64/0x140
  ? _raw_spin_lock_irq+0x86/0xe0
  do_group_exit+0xb0/0x220
  get_signal+0x17ba/0x1b50
  ? vfs_read+0x179/0xa40
  ? timerlat_fd_read+0x30b/0x9d0
  ? __pfx_get_signal+0x10/0x10
  ? __pfx_timerlat_fd_read+0x10/0x10
  arch_do_signal_or_restart+0x8c/0x570
  ? __pfx_arch_do_signal_or_restart+0x10/0x10
  ? vfs_read+0x179/0xa40
  ? ksys_read+0xfe/0x1d0
  ? __pfx_ksys_read+0x10/0x10
  syscall_exit_to_user_mode+0xbc/0x130
  do_syscall_64+0x74/0x110
  ? __pfx___rseq_handle_notify_resume+0x10/0x10
  ? __pfx_ksys_read+0x10/0x10
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? fpregs_restore_userregs+0xdb/0x1e0
  ? syscall_exit_to_user_mode+0x116/0x130
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  ? do_syscall_64+0x74/0x110
  entry_SYSCALL_64_after_hwframe+0x71/0x79
 RIP: 0033:0x7ff0070eca9c
 Code: Unable to access opcode bytes at 0x7ff0070eca72.
 RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
 RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c
 RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003
 RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0
 R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003
 R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008
  </TASK>
 Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core
 ---[ end trace 0000000000000000 ]---

This is because it would mistakenly call kthread_stop() on a user space
thread making it "exit" before it actually exits.

Since kthreads are created based on global behavior, use a cpumask to know
when kthreads are running and that they need to be shutdown before
proceeding to do new work.

Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/

This was debugged by using the persistent ring buffer:

Link: https://lore.kernel.org/all/20240823013902.135036960@goodmis.org/

Note, locking was originally used to fix this, but that proved to cause too
many deadlocks to work around:

  https://lore.kernel.org/linux-trace-kernel/20240823102816.5e55753b@gandalf.local.home/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Link: https://lore.kernel.org/20240904103428.08efdf4c@gandalf.local.home
Fixes: e88ed227f6 ("tracing/timerlat: Add user-space interface")
Reported-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 11:30:22 -04:00
Zheng Yejian
49aa8a1f4d tracing: Avoid possible softlockup in tracing_iter_reset()
In __tracing_open(), when max latency tracers took place on the cpu,
the time start of its buffer would be updated, then event entries with
timestamps being earlier than start of the buffer would be skipped
(see tracing_iter_reset()).

Softlockup will occur if the kernel is non-preemptible and too many
entries were skipped in the loop that reset every cpu buffer, so add
cond_resched() to avoid it.

Cc: stable@vger.kernel.org
Fixes: 2f26ebd549 ("tracing: use timestamp to determine start of latency traces")
Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-05 10:18:48 -04:00
Jeongjun Park
bb6705c3f9 bpf: add check for invalid name in btf_name_valid_section()
If the length of the name string is 1 and the value of name[0] is NULL
byte, an OOB vulnerability occurs in btf_name_valid_section() and the
return value is true, so the invalid name passes the check.

To solve this, you need to check if the first position is NULL byte and
if the first character is printable.

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Fixes: bd70a8fb7c ("bpf: Allow all printable characters in BTF DATASEC names")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Link: https://lore.kernel.org/r/20240831054702.364455-1-aha310510@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
2024-09-04 11:56:34 -07:00
Linus Torvalds
76c0f27d06 17 hotfixes, 15 of which are cc:stable.
Mostly MM, no identifiable theme.  And a few nilfs2 fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZtfR/wAKCRDdBJ7gKXxA
 jofjAP9rUlliIcn8zcy7vmBTuMaH4SkoULB64QWAUddaWV+SCAEA+q0sntLPnTIZ
 My3sfihR6mbvhkgKbvIHm6YYQI56NAc=
 =b4Lr
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "17 hotfixes, 15 of which are cc:stable.

  Mostly MM, no identifiable theme.  And a few nilfs2 fixups"

* tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n
  mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area
  mailmap: update entry for Jan Kuliga
  codetag: debug: mark codetags for poisoned page as empty
  mm/memcontrol: respect zswap.writeback setting from parent cg too
  scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum
  Revert "mm: skip CMA pages when they are not available"
  maple_tree: remove rcu_read_lock() from mt_validate()
  kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y
  mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook
  nilfs2: fix state management in error path of log writing function
  nilfs2: fix missing cleanup on rollforward recovery error
  nilfs2: protect references to superblock parameters exposed in sysfs
  userfaultfd: don't BUG_ON() if khugepaged yanks our page table
  userfaultfd: fix checks for huge PMDs
  mm: vmalloc: ensure vmap_block is initialised before adding to queue
  selftests: mm: fix build errors on armhf
2024-09-04 08:37:33 -07:00
Petr Tesarik
6dacd79d28 kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y
Fix the condition to exclude the elfcorehdr segment from the SHA digest
calculation.

The j iterator is an index into the output sha_regions[] array, not into
the input image->segment[] array.  Once it reaches
image->elfcorehdr_index, all subsequent segments are excluded.  Besides,
if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
may be wrongly included in the calculation.

Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
Fixes: f7cc804a9f ("kexec: exclude elfcorehdr from the segment digest")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 17:59:01 -07:00
Linus Torvalds
c9f016e72b A set of X86 fixes:
- x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
     even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
     disable it thereby creating inconsistent state.
 
     Reorder the logic so it actually works correctly
 
   - The XSTATE logic for handling LBR is incorrect as it assumes that
     XSAVES supports LBR when the CPU supports LBR. In fact both conditions
     need to be true. Otherwise the enablement of LBR in the IA32_XSS MSR
     fails and subsequently the machine crashes on the next XRSTORS
     operation because IA32_XSS is not initialized.
 
     Cache the XSTATE support bit during init and make the related functions
     use this cached information and the LBR CPU feature bit to cure this.
 
   - Cure a long standing bug in KASLR
 
     KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
     randomize the starting points of the direct map, vmalloc and vmemmap
     regions.  It thereby limits the size of the direct map by using the
     installed memory size plus an extra configurable margin for hot-plug
     memory.  This limitation is done to gain more randomization space
     because otherwise only the holes between the direct map, vmalloc,
     vmemmap and vaddr_end would be usable for randomizing.
 
     The limited direct map size is not exposed to the rest of the kernel, so
     the memory hot-plug and resource management related code paths still
     operate under the assumption that the available address space can be
     determined with MAX_PHYSMEM_BITS.
 
     request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
     downwards.  That means the first allocation happens past the end of the
     direct map and if unlucky this address is in the vmalloc space, which
     causes high_memory to become greater than VMALLOC_START and consequently
     causes iounmap() to fail for valid ioremap addresses.
 
     Cure this by exposing the end of the direct map via PHYSMEM_END and use
     that for the memory hot-plug and resource management related places
     instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
     maps to a variable which is initialized by the KASLR initialization and
     otherwise it is based on MAX_PHYSMEM_BITS as before.
 
   - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
     an initialized variabled on the stack to the VMM. The variable is only
     required as output value, so it does not have to exposed to the VMM in
     the first place.
 
   - Prevent an array overrun in the resource control code on systems with
     Sub-NUMA Clustering enabled because the code failed to adjust the index
     by the number of SNC nodes per L3 cache.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbUUu0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodFsEADFgxq2wjnH+VpuaIhLiQIfUa7iVeUl
 bwHAakZRMJ+Cb8BsvaRCMdAWWF+cRdLabAHuh7MRJFFzzdwrVTswnxT9baUBBjEe
 Kd3ZeQOS4AvWxpJNQEDg9r7tYtavmml9ix+Jh0OF+YmXLIweQk5RhDN+ncha07cJ
 0DuPt4ngI24iyAyUX+7gZsRZiwoOm0HqImaRiisaspTbGpNwnrwFQCEioCdwnAv0
 H5S7WTAlsZURCINLBNT+fV5oPjk2E3Ckj/CCJGoG1LYedGUD/44M1Hj0Xsqm4pHF
 Zd0+CuFyYpGqkAuBY6moWOheYP8V2U+yhf9Rtvh8/+h3qxZ/yon5i0ycO/2wMjiF
 0NBomMeKh4PNyefYq8lHWK3kcXphrXH3yv09wVBDdLMXDy98beuS5NScGgza8148
 /nqq0l1uLUyM9TkWg9H+4wW73EzQW1DYIliDU3tC98u+E77kQbyCx+2f0WI2k+ar
 3wy7nYzyEJXl38NUTB+La4xXbhsELcaYQ/Q6scIsWAL+6+KlRb3FNBn+HT+KmOmF
 y702km/28C0uxrLk2OQCjX/zXQtXe2/4aoUzGqFf9atsifa0IBrc8YBzdIDB49Jt
 zz/MOAZTcz4jfyD3sRfYuG2QhBbdTz3f/kd3OryquitdAGozpoeztMIGs1PU2Y6s
 zInlLtUwaosadg==
 =T4i1
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

 - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
   even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
   disable it thereby creating inconsistent state.

   Reorder the logic so it actually works correctly

 - The XSTATE logic for handling LBR is incorrect as it assumes that
   XSAVES supports LBR when the CPU supports LBR. In fact both
   conditions need to be true. Otherwise the enablement of LBR in the
   IA32_XSS MSR fails and subsequently the machine crashes on the next
   XRSTORS operation because IA32_XSS is not initialized.

   Cache the XSTATE support bit during init and make the related
   functions use this cached information and the LBR CPU feature bit to
   cure this.

 - Cure a long standing bug in KASLR

   KASLR uses the full address space between PAGE_OFFSET and vaddr_end
   to randomize the starting points of the direct map, vmalloc and
   vmemmap regions. It thereby limits the size of the direct map by
   using the installed memory size plus an extra configurable margin for
   hot-plug memory. This limitation is done to gain more randomization
   space because otherwise only the holes between the direct map,
   vmalloc, vmemmap and vaddr_end would be usable for randomizing.

   The limited direct map size is not exposed to the rest of the kernel,
   so the memory hot-plug and resource management related code paths
   still operate under the assumption that the available address space
   can be determined with MAX_PHYSMEM_BITS.

   request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
   downwards. That means the first allocation happens past the end of
   the direct map and if unlucky this address is in the vmalloc space,
   which causes high_memory to become greater than VMALLOC_START and
   consequently causes iounmap() to fail for valid ioremap addresses.

   Cure this by exposing the end of the direct map via PHYSMEM_END and
   use that for the memory hot-plug and resource management related
   places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
   PHYSMEM_END maps to a variable which is initialized by the KASLR
   initialization and otherwise it is based on MAX_PHYSMEM_BITS as
   before.

 - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
   an initialized variabled on the stack to the VMM. The variable is
   only required as output value, so it does not have to exposed to the
   VMM in the first place.

 - Prevent an array overrun in the resource control code on systems with
   Sub-NUMA Clustering enabled because the code failed to adjust the
   index by the number of SNC nodes per L3 cache.

* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix arch_mbm_* array overrun on SNC
  x86/tdx: Fix data leak in mmio_read()
  x86/kaslr: Expose and use the end of the physical memory address space
  x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
  x86/apic: Make x2apic_disable() work correctly
2024-09-01 14:43:08 -07:00
Linus Torvalds
51859c5aa6 A single fix for rt_mutex:
The deadlock detection code drops into an infinite scheduling loop while
   still holding rt_mutex::wait_lock, which rightfully triggers a
   'scheduling in atomic' warning. Unlock it before that.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmbLKh0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSh0D/sHm7YsexyVAXRxtm1vCJ1rE+ditn/K
 X60QuH8EyLa2abUY9xKKKaM3YuE+4juN1dyej6uBcET8n9olqGQn0pyxukTDy3Hq
 Khy7OIAzYfGd7J0kJXA6DHMWAi16e4GduMcDYCw0BwYCQG4D1Ybe/lfsm0mgdcd/
 aLFSD8/+oAK7bIWy9VE4wXUPT51v7iFnwajvcKsOL1c3u2StF6A0zCLbjkU5Di20
 6Ko4f+B8dh3Bh9yIP/uiwHQHdVUjXp5Y9pVuFLemC8BqGVrwMi71jEEFYQva+evU
 UTQ8VCbhxIAcz6pqWYA3P5WeqFIQLAXq5UNJLK+63qm/Wf4eoaXNhK+Nr2fCigYu
 R6T+H5nx2WATATXfPORgfHYgHWwyWaj6nUPK0kaJYUTHFK5Nlg/ir5siqE/Gz9qq
 ldajnvKpheeuu6rsrdHdJrdI84XVOmjeAJ+5A8i7VMv2jEE50txlxXKt5jeVt2yE
 xm2+NABT4Dlycf/56e6OYZUEADQfX1YlvFGBQe1UYpcmOC1nTKaWWeQ1xlA3ZR92
 plUgXLtRSKaZ3vEFQ3L+/1w0Af3P3/mapb+IgTxW/FEt8WAw6UuwBS9gMZDOhTI+
 GZF0EFj8tUmTNDcWkNAu3m3y5Qmp3iVFYBbYXmyJdKVRJbaMu/uq51JnOOIxmaLf
 L6KovDUg8eF3bQ==
 =B3iX
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Thomas Gleixner:
 "A single fix for rt_mutex.

  The deadlock detection code drops into an infinite scheduling loop
  while still holding rt_mutex::wait_lock, which rightfully triggers a
  'scheduling in atomic' warning.

  Unlock it before that"

* tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtmutex: Drop rt_mutex::wait_lock before scheduling
2024-09-01 14:26:33 -07:00
Martin KaFai Lau
b408473ea0 bpf: Fix a crash when btf_parse_base() returns an error pointer
The pointer returned by btf_parse_base could be an error pointer.
IS_ERR() check is needed before calling btf_free(base_btf).

Fixes: 8646db2389 ("libbpf,bpf: Share BTF relocate-related code with kernel")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240830012214.1646005-1-martin.lau@linux.dev
2024-08-30 10:34:47 -07:00
Linus Torvalds
3e9bff3bbe vfs-6.11-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZsxg5QAKCRCRxhvAZXjc
 olSiAQDvFvim4YtMmUDagC3yWTBsf+o3lYdAIuzNE0NtSn4vpAEAl/HVhQCaEDjv
 mcE3jokEsbvyXLnzs78PrY0Heua2mQg=
 =AHAd
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Ensure that backing files uses file->f_ops->splice_write() for
     splice

  netfs:

   - Revert the removal of PG_private_2 from netfs_release_folio() as
     cephfs still relies on this

   - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to
     always be invalidated during truncation

   - Fix losing untruncated data in a folio by making letting
     netfs_release_folio() return false if the folio is dirty

   - Fix trimming of streaming-write folios in netfs_inval_folio()

   - Reset iterator before retrying a short read

   - Fix interaction of streaming writes with zero-point tracker

  afs:

   - During truncation afs currently calls truncate_setsize() which sets
     i_size, expands the pagecache and truncates it. The first two
     operations aren't needed because they will have already been done.
     So call truncate_pagecache() instead and skip the redundant parts

  overlayfs:

   - Fix checking of the number of allowed lower layers so 500 layers
     can actually be used instead of just 499

   - Add missing '\n' to pr_err() output

   - Pass string to ovl_parse_layer() and thus allow it to be used for
     Opt_lowerdir as well

  pidfd:

   - Revert blocking the creation of pidfds for kthread as apparently
     userspace relies on this. Specifically, it breaks systemd during
     shutdown

  romfs:

   - Fix romfs_read_folio() to use the correct offset with
     folio_zero_tail()"

* tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix interaction of streaming writes with zero-point tracker
  netfs: Fix missing iterator reset on retry of short read
  netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
  netfs: Fix netfs_release_folio() to say no if folio dirty
  afs: Fix post-setattr file edit to do truncation correctly
  mm: Fix missing folio invalidation calls during truncation
  ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err
  ovl: fix wrong lowerdir number check for parameter Opt_lowerdir
  ovl: pass string to ovl_parse_layer()
  backing-file: convert to using fops->splice_write
  Revert "pidfd: prevent creation of pidfds for kthreads"
  romfs: fix romfs_read_folio()
  netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
2024-08-27 16:57:35 +12:00
Linus Torvalds
d2bafcf224 cgroup: Fixes for v6.11-rc4
Three patches addressing cpuset corner cases.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZskvjQ4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTRnAQCVJj+pPLO76ofJC51p4TcITsDD37trYHPyxaCB
 zZ7XdAEA82NhGgy+kdlICrsiBYKK10jGDNGkXWicdCI8GmEe1Qo=
 =axit
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "Three patches addressing cpuset corner cases"

* tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
  cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
  cgroup/cpuset: fix panic caused by partcmd_update
2024-08-24 10:39:18 +08:00
Linus Torvalds
cb2c84b380 workqueue: Fixes for v6.11-rc4
Nothing too interesting. One patch to remove spurious warning and others to
 address static checker warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZskq8g4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGfTVAP42MsAOyrlND+cH/zQpSc8OhGbm3v0gJFnPn4UE
 Y3B4kgD/W68n57MQ5uWh1vHHvsqjizbXfRez1dVJoGqa/q88GQs=
 =Uwdx
 -----END PGP SIGNATURE-----

Merge tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fixes from Tejun Heo:
 "Nothing too interesting. One patch to remove spurious warning and
  others to address static checker warnings"

* tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Correct declaration of cpu_pwq in struct workqueue_struct
  workqueue: Fix spruious data race in __flush_work()
  workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker
  workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
  workqueue: doc: Fix function name, remove markers
2024-08-24 10:35:57 +08:00
Christian Brauner
232590ea7f
Revert "pidfd: prevent creation of pidfds for kthreads"
This reverts commit 3b5bbe798b.

Eric reported that systemd-shutdown gets broken by blocking the creating
of pidfds for kthreads as older versions seems to rely on being able to
create a pidfd for any process in /proc.

Reported-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/r/20240818035818.GA1929@sol.localdomain
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21 22:32:58 +02:00
Masami Hiramatsu (Google)
bc754cc76d tracing: Fix memory leak in fgraph storage selftest
With ftrace boot-time selftest, kmemleak reported some memory leaks in
the new test case for function graph storage for multiple tracers.

unreferenced object 0xffff888005060080 (size 32):
  comm "swapper/0", pid 1, jiffies 4294676440
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 20 10 06 05 80 88 ff ff  ........ .......
    54 0c 1e 81 ff ff ff ff 00 00 00 00 00 00 00 00  T...............
  backtrace (crc 7c93416c):
    [<000000000238ee6f>] __kmalloc_cache_noprof+0x11f/0x2a0
    [<0000000033d2b6c5>] enter_record+0xe8/0x150
    [<0000000054c38424>] match_records+0x1cd/0x230
    [<00000000c775b63d>] ftrace_set_hash+0xff/0x380
    [<000000007bf7208c>] ftrace_set_filter+0x70/0x90
    [<00000000a5c08dda>] test_graph_storage_multi+0x2e/0xf0
    [<000000006ba028ca>] trace_selftest_startup_function_graph+0x1e8/0x260
    [<00000000a715d3eb>] run_tracer_selftest+0x111/0x190
    [<00000000395cbf90>] register_tracer+0xdf/0x1f0
    [<0000000093e67f7b>] do_one_initcall+0x141/0x3b0
    [<00000000c591b682>] do_initcall_level+0x82/0xa0
    [<000000004e4c6600>] do_initcalls+0x43/0x70
    [<0000000034f3c4e4>] kernel_init_freeable+0x170/0x1f0
    [<00000000c7a5dab2>] kernel_init+0x1a/0x1a0
    [<00000000ea105947>] ret_from_fork+0x3a/0x50
    [<00000000a1932e84>] ret_from_fork_asm+0x1a/0x30
...

This means filter hash allocated for the fixtures are not correctly
released after the test.

Free those hash lists after tests are done and split the loop for
initialize fixture and register fixture for rollback.

Fixes: dd120af2d5 ("ftrace: Add multiple fgraph storage selftest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/172411539857.28895.13119957560263401102.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-21 15:10:33 -04:00
Masami Hiramatsu (Google)
a069a22f39 tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
Since the register_ftrace_graph() assigns a new fgraph_ops to
fgraph_array before registring it by ftrace_startup_subops(), the new
fgraph_ops can be used in function_graph_enter().

In most cases, it is still OK because those fgraph_ops's hashtable is
already initialized by ftrace_set_filter*() etc.

But if a user registers a new fgraph_ops which does not initialize the
hash list, ftrace_ops_test() in function_graph_enter() causes a NULL
pointer dereference BUG because fgraph_ops->ops.func_hash is NULL.

This can be reproduced by the below commands because function profiler's
fgraph_ops does not initialize the hash list;

 # cd /sys/kernel/tracing
 # echo function_graph > current_tracer
 # echo 1 > function_profile_enabled

To fix this problem, add a new fgraph_ops to fgraph_array after
ftrace_startup_subops(). Thus, until the new fgraph_ops is initialized,
we will see fgraph_stub on the corresponding fgraph_array entry.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/172398528350.293426.8347220120333730248.stgit@devnote2
Fixes: c132be2c4f ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-21 13:49:59 -04:00
Thomas Gleixner
ea72ce5da2 x86/kaslr: Expose and use the end of the physical memory address space
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.

KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions.  It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory.  This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.

The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.

request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards.  That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.

MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits.  All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.

Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.

To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.

Fixes: 0483e1fa6e ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8@gmail.com>
Reported-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-By: Max Ramanouski <max8rr8@gmail.com>
Tested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-20 13:44:57 +02:00
Linus Torvalds
b0da640826 printk fixup for 6.11-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmbDBfYACgkQUqAMR0iA
 lPL3ohAArEJ46nPdGWXEZ+K78biXlz/F3IXT+FH95YgtpIk0Tha6Jc5xybGerf/N
 91GzWGbFweEFIIHq9i/CeBnmUEYsMocDF2hlmPiCvaqvMl1J6EuXgERUaPWqaQTS
 fPZab7x8MitH64hFGWbMbvt8ZDJXyQaixtkQyA0AoRPMTpiQy0mFWbFIhtN9M+Cx
 dov2l4N9je8X46X7SWDdKNvVEXHPnpWpq5NeMr9FW7yM4Kun3Hdb3Ks58sHS2oLm
 EmPFQ6kNuxpHyXNvfjeE/JdXQZvK2gGOCNS4zykpGVYJJvhmfrNSwR7iGhm0z/Zw
 sFObF46fK2NTkD5UZ9jQK8+uTiOwpiZSka8v55LocLa7gg2e1G7owaRSIMKjeNYT
 GVmcdkgLqdtfKo3D3rM+auWXlP9o+ioqM52HCewWzMXd0HC2nLx28X/66oHbif9U
 qJSjDPTtvlVEfIcbLr0bRX9KrYeqwtXD74zxB+msbi3Z2C/O9CrFfnGaI0h6+8cb
 RwAptjiO8QdbKkL06CW5RjM5ulNqtPmRETziwA01gh5h6AE5oR1PHCf0DM12ulYK
 /gY/rMznZ6qK0G+BYQyRhMgZh5P5KPvL77a7kxknuj4va2s6c2EsnG8u5iYcYAdo
 YHWN6Jad1OPfQyHsqQ7IL+zlQzTPKmuy3PHQcZwBezUPWRY96kI=
 =2wc2
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

 - Do not block printk on non-panic CPUs when they are dumping
   backtraces

* tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
2024-08-19 09:26:35 -07:00
Linus Torvalds
c3f2d783a4 16 hotfixes. All except one are for MM. 10 of these are cc:stable and
the others pertain to post-6.10 issues.
 
 As usual with these merges, singletons and doubletons all over the place,
 no identifiable-by-me theme.  Please see the lovingly curated changelogs
 to get the skinny.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZsFf8wAKCRDdBJ7gKXxA
 jvEUAP97y/sqKD8rQNc0R8fRGSPNPamwyok8RHwohb0JEHovlAD9HsQ9Ad57EpqR
 wBexMxJRFc7Dt73Tu6IkLQ1iNGqABAc=
 =8KNp
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "16 hotfixes. All except one are for MM. 10 of these are cc:stable and
  the others pertain to post-6.10 issues.

  As usual with these merges, singletons and doubletons all over the
  place, no identifiable-by-me theme. Please see the lovingly curated
  changelogs to get the skinny"

* tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/migrate: fix deadlock in migrate_pages_batch() on large folios
  alloc_tag: mark pages reserved during CMA activation as not tagged
  alloc_tag: introduce clear_page_tag_ref() helper function
  crash: fix riscv64 crash memory reserve dead loop
  selftests: memfd_secret: don't build memfd_secret test on unsupported arches
  mm: fix endless reclaim on machines with unaccepted memory
  selftests/mm: compaction_test: fix off by one in check_compaction()
  mm/numa: no task_numa_fault() call if PMD is changed
  mm/numa: no task_numa_fault() call if PTE is changed
  mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
  mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
  mm: don't account memmap per-node
  mm: add system wide stats items category
  mm: don't account memmap on failure
  mm/hugetlb: fix hugetlb vs. core-mm PT locking
  mseal: fix is_madv_discard()
2024-08-17 19:50:16 -07:00
Linus Torvalds
810996a363 powerpc fixes for 6.11 #2
- Fix crashes on 85xx with some configs since the recent hugepd rework.
 
  - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some platforms.
 
  - Don't enable offline cores when changing SMT modes, to match existing
    userspace behaviour.
 
 Thanks to: Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal Jan
 K.A, Shrikanth Hegde, Thomas Gleixner, Tyrel Datwyler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJLBAABCAA1FiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmbBN48XHG1pY2hhZWxA
 ZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYDFhA/7ByodEuDtTZRAhQxJbzTlEMMk
 OdEURo5MqJZo2P9A3G1KKQKUUy1cQwKLcOaCa7nSh3IXHswXEGZK/Do1lgUj8BAx
 BcaTlm6aAgMnxkEXIGMNBCGn54IxA7pQV7TUUdr+3CJU0udtYceej03beWZuQVvN
 DxdoHflNojU+h8AUWEm5KW6X/o8C+DI6rMAP5zW8Xvsbz/QmSSn1frAs+Dgnacyh
 niAToWbW4ibw0LJ8NBDIxIgqDXZHGUY9/KMSAn1WgpERcbY8FUD3PWw2FzJxjqKw
 h/sjDRpFhY7mImZtzTKez2OHMPiq+730OVEmgfoER/smknnIYi/tO4e2r+wA9YS7
 IIpyl42sdTPV6ke1DDT5sUlWq4LjPLobB+2WKwgDkSOnTRjF1/9nf4AVdtwh2cuS
 Y/Sttz3YjtfeSPG3sWnn5HkMbBksMoSSO+Q9BqB2BQAIHWHPDZWwadGhSw1omV7/
 poYoR3KbmomLL39qk49P0thmhhCDhF64j7XN4ESFUK7tFL1BHCZ2vXSI5vIi0CHZ
 z65pJxsid/0oz04abINAsrDOyZTIkPBTDawda4UEHfXpUOOM9iFPfQfcFnJYRCPk
 xiOYAhRj10l7eQeSXOcaP1TXraW+DCs4N5neCaZ0zI/4vwTcrFMn37bB7DVYLjkB
 08vDj12ybMrz51mjCj4=
 =sZ+f
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix crashes on 85xx with some configs since the recent hugepd rework.

 - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some
   platforms.

 - Don't enable offline cores when changing SMT modes, to match existing
   userspace behaviour.

Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal
Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler.

* tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/topology: Check if a core is online
  cpu/SMT: Enable SMT only if a core is online
  powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL
  powerpc/mm: Fix size of allocated PGDIR
  soc: fsl: qbman: remove unused struct 'cgr_comp'
2024-08-17 19:23:02 -07:00
Linus Torvalds
4a621e2910 A couple of fixes for tracing:
- Prevent a NULL pointer dereference in the error path of RTLA tool
 
 - Fix an infinite loop bug when reading from the ring buffer when closed.
   If there's a thread trying to read the ring buffer and it gets closed
   by another thread, the one reading will go into an infinite loop
   when the buffer is empty instead of exiting back to user space.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZr9fuRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqV8AQCoAmS7Mov+BLtL1am5HcGvqv60E9IL
 1BlGQAsRYeLmMgD/UjUOXx3PfrQaKt7O479NT7NxOm6vPFA5e7W611M4KQw=
 =QGI+
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A couple of fixes for tracing:

   - Prevent a NULL pointer dereference in the error path of RTLA tool

   - Fix an infinite loop bug when reading from the ring buffer when
     closed. If there's a thread trying to read the ring buffer and it
     gets closed by another thread, the one reading will go into an
     infinite loop when the buffer is empty instead of exiting back to
     user space"

* tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rtla/osnoise: Prevent NULL dereference in error handling
  tracing: Return from tracing_buffers_read() if the file has been closed
2024-08-16 11:12:29 -07:00
Jinjie Ruan
edb907a613 crash: fix riscv64 crash memory reserve dead loop
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
will cause system stall as below:

	 Zone ranges:
	   DMA32    [mem 0x0000000080000000-0x000000009fffffff]
	   Normal   empty
	 Movable zone start for each node
	 Early memory node ranges
	   node   0: [mem 0x0000000080000000-0x000000008005ffff]
	   node   0: [mem 0x0000000080060000-0x000000009fffffff]
	 Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
	(stall here)

commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
bug") fix this on 32-bit architecture.  However, the problem is not
completely solved.  If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on
64-bit architecture, for example, when system memory is equal to
CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also
occur:

	-> reserve_crashkernel_generic() and high is true
	   -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
	      -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
	         (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).

As Catalin suggested, do not remove the ",high" reservation fallback to
",low" logic which will change arm64's kdump behavior, but fix it by
skipping the above situation similar to commit d2f32f23190b ("crash: fix
x86_32 crash memory reserve dead loop").

After this patch, it print:
	cannot allocate crashkernel (size:0x1f400000)

Link: https://lkml.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Dave Young <dyoung@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:16 -07:00
Linus Torvalds
e724918b37 hardening fixes for v6.11-rc4
- gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
   (Thorsten Blum)
 
 - kallsyms: Clean up interaction with LTO suffixes (Song Liu)
 
 - refcount: Report UAF for refcount_sub_and_test(0) when counter==0
   (Petr Pavlu)
 
 - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZr5D6wAKCRA2KwveOeQk
 u5dXAQC9ddd3iHqDAWfbCLY41/5K3KByFspVqf8hw2sFK3Uq9wD/eWU0hWFIk1gq
 1hUSb7vExo+oiahYPKIUMx5Zf69hHAk=
 =dmVd
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:

 - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
   (Thorsten Blum)

 - kallsyms: Clean up interaction with LTO suffixes (Song Liu)

 - refcount: Report UAF for refcount_sub_and_test(0) when counter==0
   (Petr Pavlu)

 - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov)

* tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
  kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols
  kunit/overflow: Fix UB in overflow_allocation_test
  gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
  refcount: Report UAF for refcount_sub_and_test(0) when counter==0
2024-08-15 11:50:07 -07:00
Song Liu
fb6a421fb6 kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
With CONFIG_LTO_CLANG=y, the compiler may add .llvm.<hash> suffix to
function names to avoid duplication. APIs like kallsyms_lookup_name()
and kallsyms_on_each_match_symbol() tries to match these symbol names
without the .llvm.<hash> suffix, e.g., match "c_stop" with symbol
c_stop.llvm.17132674095431275852. This turned out to be problematic
for use cases that require exact match, for example, livepatch.

Fix this by making the APIs to match symbols exactly.

Also cleanup kallsyms_selftests accordingly.

Signed-off-by: Song Liu <song@kernel.org>
Fixes: 8cc32a9bbf ("kallsyms: strip LTO-only suffixes from promoted global functions")
Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240807220513.3100483-3-song@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-15 09:33:35 -07:00
Roland Xu
d33d26036a rtmutex: Drop rt_mutex::wait_lock before scheduling
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held.  In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.

Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.

[ tglx: Moved unlock before the WARN(), removed the pointless comment,
  	massaged changelog, added Fixes tag ]

Fixes: 3d5c9340d1 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu <mu001999@outlook.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM
2024-08-15 15:38:53 +02:00
Linus Torvalds
4ac0f08f44 vfs-6.11-rc4.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
 oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
 M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
 =HZBL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Kyle Huey
100bff2381 perf/bpf: Don't call bpf_overflow_handler() for tracing events
The regressing commit is new in 6.10. It assumed that anytime event->prog
is set bpf_overflow_handler() should be invoked to execute the attached bpf
program. This assumption is false for tracing events, and as a result the
regressing commit broke bpftrace by invoking the bpf handler with garbage
inputs on overflow.

Prior to the regression the overflow handlers formed a chain (of length 0,
1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added
bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog()
(the tracing case) did not. Both set event->prog. The chain of overflow
handlers was replaced by a single overflow handler slot and a fixed call to
bpf_overflow_handler() when appropriate. This modifies the condition there
to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the
previous behavior and fixing bpftrace.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Joe Damato <jdamato@fastly.com>
Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/
Fixes: f11f10bfa1 ("perf/bpf: Call BPF handler directly, not through overflow machinery")
Cc: stable@vger.kernel.org
Tested-by: Joe Damato <jdamato@fastly.com> # bpftrace
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240813151727.28797-1-jdamato@fastly.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-13 10:25:28 -07:00
Ryo Takakura
bcc954c6ca printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
commit 779dbc2e78 ("printk: Avoid non-panic CPUs writing
to ringbuffer") disabled non-panic CPUs to further write messages to
ringbuffer after panicked.

Since the commit, non-panicked CPU's are not allowed to write to
ring buffer after panicked and CPU backtrace which is triggered
after panicked to sample non-panicked CPUs' backtrace no longer
serves its function as it has nothing to print.

Fix the issue by allowing non-panicked CPUs to write into ringbuffer
while CPU backtrace is in flight.

Fixes: 779dbc2e78 ("printk: Avoid non-panic CPUs writing to ringbuffer")
Signed-off-by: Ryo Takakura <takakura@valinux.co.jp>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp
Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-13 14:16:22 +02:00
Yonghong Song
bed2eb964c bpf: Fix a kernel verifier crash in stacksafe()
Daniel Hodges reported a kernel verifier crash when playing with sched-ext.
Further investigation shows that the crash is due to invalid memory access
in stacksafe(). More specifically, it is the following code:

    if (exact != NOT_EXACT &&
        old->stack[spi].slot_type[i % BPF_REG_SIZE] !=
        cur->stack[spi].slot_type[i % BPF_REG_SIZE])
            return false;

The 'i' iterates old->allocated_stack.
If cur->allocated_stack < old->allocated_stack the out-of-bound
access will happen.

To fix the issue add 'i >= cur->allocated_stack' check such that if
the condition is true, stacksafe() should fail. Otherwise,
cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access is legal.

Fixes: 2793a8b015 ("bpf: exact states comparison for iterator convergence checks")
Cc: Eduard Zingerman <eddyz87@gmail.com>
Reported-by: Daniel Hodges <hodgesd@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240812214847.213612-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-12 18:09:48 -07:00
Nysal Jan K.A
6c17ea1f3e cpu/SMT: Enable SMT only if a core is online
If a core is offline then enabling SMT should not online CPUs of
this core. By enabling SMT, what is intended is either changing the SMT
value from "off" to "on" or setting the SMT level (threads per core) from a
lower to higher value.

On PowerPC the ppc64_cpu utility can be used, among other things, to
perform the following functions:

ppc64_cpu --cores-on                # Get the number of online cores
ppc64_cpu --cores-on=X              # Put exactly X cores online
ppc64_cpu --offline-cores=X[,Y,...] # Put specified cores offline
ppc64_cpu --smt={on|off|value}      # Enable, disable or change SMT level

If the user has decided to offline certain cores, enabling SMT should
not online CPUs in those cores. This patch fixes the issue and changes
the behaviour as described, by introducing an arch specific function
topology_is_core_online(). It is currently implemented only for PowerPC.

Fixes: 73c58e7e14 ("powerpc: Add HOTPLUG_SMT support")
Reported-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Closes: https://groups.google.com/g/powerpc-utils-devel/c/wrwVzAAnRlI/m/5KJSoqP4BAAJ
Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240731030126.956210-2-nysal@linux.ibm.com
2024-08-13 10:31:24 +10:00
Christian Brauner
3b5bbe798b
pidfd: prevent creation of pidfds for kthreads
It's currently possible to create pidfds for kthreads but it is unclear
what that is supposed to mean. Until we have use-cases for it and we
figured out what behavior we want block the creation of pidfds for
kthreads.

Link: https://lore.kernel.org/r/20240731-gleis-mehreinnahmen-6bbadd128383@brauner
Fixes: 32fcb426ec ("pid: add pidfd_open()")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
Linus Torvalds
7270e931b5 Updates for time keeping:
- Fix a couple of issues in the NTP code where user supplied values are
     neither sanity checked nor clamped to the operating range. This results
     in integer overflows and eventualy NTP getting out of sync.
 
     According to the history the sanity checks had been removed in favor of
     clamping the values, but the clamping never worked correctly under all
     circumstances. The NTP people asked to not bring the sanity checks back
     as it might break existing applications.
 
     Make the clamping work correctly and add it where it's missing
 
   - If adjtimex() sets the clock it has to trigger the hrtimer subsystem so
     it can adjust and if the clock was set into the future expire timers if
     needed. The caller should provide a bitmask to tell hrtimers which
     clocks have been adjusted. adjtimex() uses not the proper constant and
     uses CLOCK_REALTIME instead, which is 0. So hrtimers adjusts only the
     clocks, but does not check for expired timers, which might make them
     expire really late. Use the proper bitmask constant instead.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4wQ0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWNmEACMeq/vMoqbbhfgmTK2+XKfUarF5AX8
 61uK/rY6ysO/Qz1P+3K4j+coxhuz2t0ekjIL6htgPE0yU5JR3/VjjUpGIbBLUZfa
 aY9Ciy0OHFyTaoduyLKyiO/O7GyI6j8vMMhhNyQDaK5Zm+pIin18FqW6udg79HYh
 bDkVtCWg27M1zFd9aqRAc1EX+uFfCrSUi+1oc+E3/knDrNFUVwKCznAeDQQZii6Y
 pGmt733o7RRkABSf5T1bNOEVpbMlZowcf7zF3J57otz/lZFuwjRtTdmuG4ha3grs
 B+4FLNRZFEIEFPW0We43gAW1jLNjIL8xgZ6CMUwkUYOGQ21wmMxTOUCwg6/YMa9Y
 vBceijrICOa1EsyO28XqgRkfIvhdoNsp+c5rAN4LcQd5T7F0SoQCn9A71LXpPXgK
 ulnWjAgpt+ovD2+OFX0Ul5ySY04TgPcNVeJfnZeYxpuShlPg0GX+z0RuMl9aLbc3
 y11P0PDJiguZaoUZ8lUU2W6XA+eFEA2ZOqP+L6FZwIaDwutmXSjHR//ZkTcNg4/h
 rIbB8SFsq3BSMo3Ry2p/KMYWoZ1fF3Tm3Qp9/wpiAx1YSTJ6x8LGkHHq5c9qP5ba
 qJWi0vz8dgTGd2ta/xzglvPVWwT08rvrwACHCTcJp3Jq8uvJ27mQbTvZs6p3cFE6
 RkEBGDvEIfADew==
 =EY09
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull time keeping fixes from Thomas Gleixner:

 - Fix a couple of issues in the NTP code where user supplied values are
   neither sanity checked nor clamped to the operating range. This
   results in integer overflows and eventualy NTP getting out of sync.

   According to the history the sanity checks had been removed in favor
   of clamping the values, but the clamping never worked correctly under
   all circumstances. The NTP people asked to not bring the sanity
   checks back as it might break existing applications.

   Make the clamping work correctly and add it where it's missing

 - If adjtimex() sets the clock it has to trigger the hrtimer subsystem
   so it can adjust and if the clock was set into the future expire
   timers if needed. The caller should provide a bitmask to tell
   hrtimers which clocks have been adjusted.

   adjtimex() uses not the proper constant and uses CLOCK_REALTIME
   instead, which is 0. So hrtimers adjusts only the clocks, but does
   not check for expired timers, which might make them expire really
   late. Use the proper bitmask constant instead.

* tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
  ntp: Safeguard against time_constant overflow
  ntp: Clamp maxerror and esterror to operating range
2024-08-11 10:15:34 -07:00
Linus Torvalds
56fe0a6a9f Three small fixes for interrupt core and drivers:
- The interrupt core fails to honor caller supplied affinity hints for
       non-managed interrupts and uses the system default affinity on
       startup instead. Set the missing flag in the descriptor to tell the
       core to use the provided affinity.
 
     - Fix a shift out of bounds error in the Xilinx driver
 
     - Handle switching to level trigger correctly in the RISCV APLIC
       driver. It failed to retrigger the interrupt which causes it to
       become stale.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAma4vEUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQpoEACcirhCU0x7jfGj7KWJtnx1dko1gG9G
 AN86+1lZaHa63vBysAvvEPFVhrbC9JI09SLFTNYrhFTWk9lZeTr7HC9ZgvH2U/Yp
 YrYci/5PMBZow7rHjJUcohGM25xFppskMwtUnp1udNsPbXQvY/cFkzi/p5xwfB7b
 S4P10UuZTLBiHYDylphIjIQpf2ltQiXDcdxLGeeYnMVdQ4W5sJVqj39YfZmq+Au3
 E2IwDuA6SyPIMuEbs+rxKHNl30QmaGhU4CmzOE6A6bgcZ9u4AbvSf1+3maeOrOQf
 Erq3oMPhKemWXHdeTIZiufOjJZjph2qJfMNSzEoYnOO1edA+I9y5BkirngIwUOKX
 iAl3Oh5f6GwcNuFeVCAW6xr1jMnRDQ3SJUk89wxfgxtZjTVUTjbbtegm97XirhSf
 +QXXgVX9zpPYwbAVdwsCoSYi+Ne9XPj+ylixRKBzx6+4McnAdx3OllyfRhH7Hk53
 BuXGmSdy9/n+093jyNzhdyQ/5U1lL2XrUXoNh79M/duBp6RI0jpet406Ui/Q96VB
 mXKXG/0imRd/WCWR9KDzKjjWdNcToRcsfta7ZzeekJtFIab6e2+G65lIPALB3rNp
 1vNfFEWWTjpHpzN2pmmaRQBbwRBpLUrr89wc2KENuHs7DnQu75O1u5SZLPGwEMEI
 VuBxXQUKGxZkTA==
 =hR0M
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "Three small fixes for interrupt core and drivers:

   - The interrupt core fails to honor caller supplied affinity hints
     for non-managed interrupts and uses the system default affinity on
     startup instead. Set the missing flag in the descriptor to tell the
     core to use the provided affinity.

   - Fix a shift out of bounds error in the Xilinx driver

   - Handle switching to level trigger correctly in the RISCV APLIC
     driver. It failed to retrigger the interrupt which causes it to
     become stale"

* tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration
  irqchip/xilinx: Fix shift out of bounds
  genirq/irqdesc: Honor caller provided affinity in alloc_desc()
2024-08-11 10:07:52 -07:00
Linus Torvalds
0409cc53c4 dma-mapping fix for Linux 6.11
- avoid a deadlock with dma-debug and netconsole (Rik van Riel)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAma3CHILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPMEg/+M/k+Z+KcFgUNBmsyiPckRHAEBwzC+zpHWcaX0uEJ
 xYy3F2JW/+ba2p8GzTAwb5gE1DZTmBp0GC9PFYaolRBhhoeRZnWXimO4OynxFf2b
 vMrqyFixBYG3GeX+pnLFsT5WB+ZoZVknF/Fvxl9RmJjUh8p4KMJw7CLflu4xqsc6
 6lX+IlcV637vZOToFXm2h5todAgqoatz1VhyLekGOL5BEUuDw8QjydqpXd7XAG+T
 S0/rIga0fcmTjn+6+5RUzpcyaVAxy/KVzvHx731kjO/ZUswritxlSydZgtD0Tabj
 z8N/3ge+TGvekpffSZ6K1EmldCypuQu/WRDlwxDx5LQu4DP4vwkUURToggiQPHlQ
 7YP0roOLLfc5zgjQsmpzj/DmymFw+E3bFz6DRw+9f8ftt6rB58ICCO+YXjL4W4aL
 wu5IrUsIwoc5W7nBkrlUQZbRTrTrvl62HbuO1pMIirZ4bntuJVYLyOTJ+n7RwYFl
 TukNu5WlkHnHvnMt96TGW5lVKBTDGz1aUYUju41USyLpYCZYsKYrHiEAdf0WFB8q
 WXprsL6JSSRZ+ukIvucHDdZlBptqaxrLtj3UeALPle05dq12ykG6KOix3FZGVAWA
 0WD6SKUV7Z+Cs+WcCnW2zLNuq3NNTiSRCMSvPmSH7soxu3BLRUxPkwTTthgeGlFx
 DZs=
 =tNDn
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - avoid a deadlock with dma-debug and netconsole (Rik van Riel)

* tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping:
  dma-debug: avoid deadlock between dma debug vs printk and netconsole
2024-08-10 10:19:05 -07:00
Steven Rostedt
d0949cd44a tracing: Return from tracing_buffers_read() if the file has been closed
When running the following:

 # cd /sys/kernel/tracing/
 # echo 1 > events/sched/sched_waking/enable
 # echo 1 > events/sched/sched_switch/enable
 # echo 0 > tracing_on
 # dd if=per_cpu/cpu0/trace_pipe_raw of=/tmp/raw0.dat

The dd task would get stuck in an infinite loop in the kernel. What would
happen is the following:

When ring_buffer_read_page() returns -1 (no data) then a check is made to
see if the buffer is empty (as happens when the page is not full), it will
call wait_on_pipe() to wait until the ring buffer has data. When it is it
will try again to read data (unless O_NONBLOCK is set).

The issue happens when there's a reader and the file descriptor is closed.
The wait_on_pipe() will return when that is the case. But this loop will
continue to try again and wait_on_pipe() will again return immediately and
the loop will continue and never stop.

Simply check if the file was closed before looping and exit out if it is.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240808235730.78bf63e5@rorschach.local.home
Fixes: 2aa043a55b ("tracing/ring-buffer: Fix wait_on_pipe() race")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-09 12:59:35 -04:00
Linus Torvalds
146430a0c2 Probes fixes for v6.11-rc2:
- kprobes: Fix misusing str_has_prefix() parameter order to check symbol
   prefix correctly.
 - bpf: kprobe: remove unused declaring of bpf_kprobe_override.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAma2M6UbHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bGkIH/3ldvZswD2Fl+XA7tvvC
 7DJbeYBiYXDo1AqcpbC9dgoQL4EBZOoyXBMMks2von/Qekrq1wU8wQQTvFEfpz9m
 7RzYYy8tKZa6/RzHf+vfM8yDCMgka3C4NlFyVaohIBOXDKpIhx1cfvmXixPx1Q9S
 9IEdqxWdMrA5FPZH6ks13s+yHqQoAvyN40cmDL9bVETHe4vH4oMABfBjppUzlRcz
 C9fLv7Aw3GTmkwX8mQYeHRG4sntUcqSjn2Ik1uvWizq2yYAIMe7RAbHXP/Wvl01h
 p6U8kSb/Q7nFIdF5cHJy/XMH/2wb/1MtqPzXeNrGGhHgnejDPS+0GRJzk1CDr9sc
 ur0=
 =/tVN
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull kprobe fixes from Masami Hiramatsu:

 - Fix misusing str_has_prefix() parameter order to check symbol prefix
   correctly

 - bpf: remove unused declaring of bpf_kprobe_override

* tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Fix to check symbol prefixes correctly
  bpf: kprobe: remove unused declaring of bpf_kprobe_override
2024-08-09 09:43:46 -07:00
Linus Torvalds
2124d84db2 module: make waiting for a concurrent module loader interruptible
The recursive aes-arm-bs module load situation reported by Russell King
is getting fixed in the crypto layer, but this in the meantime fixes the
"recursive load hangs forever" by just making the waiting for the first
module load be interruptible.

This should now match the old behavior before commit 9b9879fc03
("modules: catch concurrent module loads, treat them as idempotent"),
which used the different "wait for module to be ready" code in
module_patient_check_exists().

End result: a recursive module load will still block, but now a signal
will interrupt it and fail the second module load, at which point the
first module will successfully complete loading.

Fixes: 9b9879fc03 ("modules: catch concurrent module loads, treat them as idempotent")
Cc: Russell King <linux@armlinux.org.uk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-09 08:33:28 -07:00
Linus Torvalds
9466b6ae6b tracing fixes for v6.11:
- Have reading of event format files test if the meta data still exists.
   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the meta data is
   set to state that it is to prevent any new references to it from happening
   while waiting for existing references to close. When the last reference
   closes, the meta data is freed. But the "format" was missing a check to
   this flag (along with some other files) that allowed new references to
   happen, and a use-after-free bug to occur.
 
 - Have the trace event meta data use the refcount infrastructure instead
   of relying on its own atomic counters.
 
 - Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.
 
 - Have eventfs_create_dir() return an ERR_PTR instead of NULL as
   the callers expect a real object or an ERR_PTR.
 
 - Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.
 
 - Fix ftrace_graph_ret_addr() to use the task passed in and not current.
 
 - Fix overflow bug in get_free_elt() where the counter can overflow
   the integer and cause an infinite loop.
 
 - Remove unused function ring_buffer_nr_pages()
 
 - Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own. When the kernel had randomize structure fields
   enabled, the rcu field of the tracefs_inode was overlapping the
   rcu field of the inode structure, and corrupting it. Instead,
   use the destroy_inode() callback to do the initial cleanup of
   the code, and then have free_inode() free it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZrTvXxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qu39AP9ze6ELpShDrxbXhf0adbNqG2IXMepa
 MMLqfq8tU8E/vAEAuZXJ6rKXeGvKeONa06ocvWJ0dpb2cy/n4hmx+KtM5gI=
 =Pkh4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Have reading of event format files test if the metadata still exists.

   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
   is set to state that it is to prevent any new references to it from
   happening while waiting for existing references to close. When the
   last reference closes, the metadata is freed. But the "format" was
   missing a check to this flag (along with some other files) that
   allowed new references to happen, and a use-after-free bug to occur.

 - Have the trace event meta data use the refcount infrastructure
   instead of relying on its own atomic counters.

 - Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.

 - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
   callers expect a real object or an ERR_PTR.

 - Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.

 - Fix ftrace_graph_ret_addr() to use the task passed in and not
   current.

 - Fix overflow bug in get_free_elt() where the counter can overflow the
   integer and cause an infinite loop.

 - Remove unused function ring_buffer_nr_pages()

 - Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own.

   When the kernel had randomize structure fields enabled, the rcu field
   of the tracefs_inode was overlapping the rcu field of the inode
   structure, and corrupting it. Instead, use the destroy_inode()
   callback to do the initial cleanup of the code, and then have
   free_inode() free it.

* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: Use generic inode RCU for synchronizing freeing
  ring-buffer: Remove unused function ring_buffer_nr_pages()
  tracing: Fix overflow in get_free_elt()
  function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
  eventfs: Use SRCU for freeing eventfs_inodes
  eventfs: Don't return NULL in eventfs_create_dir()
  tracefs: Fix inode allocation
  tracing: Use refcount for trace_event_file reference counter
  tracing: Have format file honor EVENT_FILE_FL_FREED
2024-08-08 13:32:59 -07:00
Linus Torvalds
b3f5620f76 bcachefs fixes for 6.11-rc3
Assorted little stuff:
 - lockdep fixup for lockdep_set_notrack_class()
 - we can now remove a device when using erasure coding without
   deadlocking, though we still hit other issues
 - the "allocator stuck" timeout is now configurable, and messages are
   ratelimited; default timeout has been increased from 10 seconds to 30
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma05JwACgkQE6szbY3K
 bnYE+Q//VJ6R/UxDoxjk8zgftRcdgwXod6U+/E0Kj3ZBKLYXkcGaWWmmGMkFafBp
 eL7Y3wtHSKiMsHYX9KEdFUZFLe1KI4c16RgNIXk9nwkF+3/+8pEDHKPFuoGHJH3O
 HComHGqwVg8Zx2jRNvEkvQ980iH7OBGhCjMFXhJ3xbMGLdw91TQQi49a+Q/vz7QT
 y3Cl1dgX5xBl7fqKefsYa+X6mpWi4/6t60vJvatI+bvDfznjI6jN3qGVLlQye7tC
 6VbJAjHsPPyNMlWa99UaHqDdaM325zR2ES0bsfHd8Up4iAwO8OgjzYQxpYTgi51i
 6DTiGEOV2S8gF+Rnprnbzsnau0hEvrtQY2Ub85TCIGbZJa8b+aDIlq9k8jF36O2E
 2CUTleQ/E129RxXpkZGsVRpNmemdCi6rHAcluaFEgezX4FJH8BVOwQQq2Xz7rd7E
 3ZP5iAWmX0IgOL0VOCP/ZXl/lEMwSk0VAED3jEbT7f2K7rU9nXDO2bIEx1wXDCm1
 b32kvmUi2FBjqLHSqvAPEb52tvvZuliMUY7z9dEx+AX9AVC9kGE+amGexosKb/LY
 nWzey+D0cKHtgbkMFrCClkpg75Tnt9ISJbad53+5qhN8an/a71djdj8Zk0UQnQjv
 6Amv4Ns1lDo3XGC1QtYkF5HqiWaupbUXAftptpS4Av4X1zZEQIc=
 =q1dD
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Assorted little stuff:

   - lockdep fixup for lockdep_set_notrack_class()

   - we can now remove a device when using erasure coding without
     deadlocking, though we still hit other issues

   - the 'allocator stuck' timeout is now configurable, and messages are
     ratelimited. The default timeout has been increased from 10 seconds
     to 30"

* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
  bcachefs: Make allocator stuck timeout configurable, ratelimit messages
  bcachefs: Add missing path_traverse() to btree_iter_next_node()
  bcachefs: ec should not allocate from ro devs
  bcachefs: Improved allocator debugging for ec
  bcachefs: Add missing bch2_trans_begin() call
  bcachefs: Add a comment for bucket helper types
  bcachefs: Don't rely on implicit unsigned -> signed integer conversion
  lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
  bcachefs: Fix double free of ca->buckets_nouse
2024-08-08 13:27:31 -07:00
Linus Torvalds
cb5b81bc9a module: warn about excessively long module waits
Russell King reported that the arm cbc(aes) crypto module hangs when
loaded, and Herbert Xu bisected it to commit 9b9879fc03 ("modules:
catch concurrent module loads, treat them as idempotent"), and noted:

 "So what's happening here is that the first modprobe tries to load a
  fallback CBC implementation, in doing so it triggers a load of the
  exact same module due to module aliases.

  IOW we're loading aes-arm-bs which provides cbc(aes). However, this
  needs a fallback of cbc(aes) to operate, which is made out of the
  generic cbc module + any implementation of aes, or ecb(aes). The
  latter happens to also be provided by aes-arm-cb so that's why it
  tries to load the same module again"

So loading the aes-arm-bs module ends up wanting to recursively load
itself, and the recursive load then ends up waiting for the original
module load to complete.

This is a regression, in that it used to be that we just tried to load
the module multiple times, and then as we went on to install it the
second time we would instead just error out because the module name
already existed.

That is actually also exactly what the original "catch concurrent loads"
patch did in commit 9828ed3f69 ("module: error out early on concurrent
load of the same module file"), but it turns out that it ends up being
racy, in that erroring out before the module has been fully initialized
will cause failures in dependent module loading.

See commit ac2263b588 (which was the revert of that "error out early")
commit for details about why erroring out before the module has been
initialized is actually fundamentally racy.

Now, for the actual recursive module load (as opposed to just
concurrently loading the same module twice), the race is not an issue.

At the same time it's hard for the kernel to see that this is recursion,
because the module load is always done from a usermode helper, so the
recursion is not some simple callchain within the kernel.

End result: this is not the real fix, but this at least adds a warning
for the situation (admittedly much too late for all the debugging pain
that Russell and Herbert went through) and if we can come to a
resolution on how to detect the recursion properly, this re-organizes
the code to make that easier.

Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/
Reported-by: Russell King <linux@armlinux.org.uk>
Debugged-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-08 12:29:40 -07:00
Linus Torvalds
660e4b18a7 9 hotfixes. 5 are cc:stable, 4 either pertain to post-6.10 material or
aren't considered necessary for earlier kernels.  5 are MM and 4 are
 non-MM.  No identifiable theme here - please see the individual changelogs.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZrQhyAAKCRDdBJ7gKXxA
 jvLLAP46sQ/HspAbx+5JoeKBTiX6XW4Hfd+MAk++EaTAyAhnxQD+Mfq7rPOIHm/G
 wiXPVvLO8FEx0lbq06rnXvdotaWFrQg=
 =mlE4
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Nine hotfixes. Five are cc:stable, the others either pertain to
  post-6.10 material or aren't considered necessary for earlier kernels.

  Five are MM and four are non-MM. No identifiable theme here - please
  see the individual changelogs"

* tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  padata: Fix possible divide-by-0 panic in padata_mt_helper()
  mailmap: update entry for David Heidelberg
  memcg: protect concurrent access to mem_cgroup_idr
  mm: shmem: fix incorrect aligned index when checking conflicts
  mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
  mm: list_lru: fix UAF for memory cgroup
  kcov: properly check for softirq context
  MAINTAINERS: Update LTP members and web
  selftests: mm: add s390 to ARCH check
2024-08-08 07:32:20 -07:00
Waiman Long
6d45e1c948 padata: Fix possible divide-by-0 panic in padata_mt_helper()
We are hit with a not easily reproducible divide-by-0 panic in padata.c at
bootup time.

  [   10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
  [   10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
  [   10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
  [   10.017908] Workqueue: events_unbound padata_mt_helper
  [   10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
    :
  [   10.017963] Call Trace:
  [   10.017968]  <TASK>
  [   10.018004]  ? padata_mt_helper+0x39/0xb0
  [   10.018084]  process_one_work+0x174/0x330
  [   10.018093]  worker_thread+0x266/0x3a0
  [   10.018111]  kthread+0xcf/0x100
  [   10.018124]  ret_from_fork+0x31/0x50
  [   10.018138]  ret_from_fork_asm+0x1a/0x30
  [   10.018147]  </TASK>

Looking at the padata_mt_helper() function, the only way a divide-by-0
panic can happen is when ps->chunk_size is 0.  The way that chunk_size is
initialized in padata_do_multithreaded(), chunk_size can be 0 when the
min_chunk in the passed-in padata_mt_job structure is 0.

Fix this divide-by-0 panic by making sure that chunk_size will be at least
1 no matter what the input parameters are.

Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
Fixes: 004ed42638 ("padata: add basic support for multithreaded jobs")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07 18:33:56 -07:00
Andrey Konovalov
7d4df2dad3 kcov: properly check for softirq context
When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
check whether the code is running in the softirq context.  Unfortunately,
in_serving_softirq() is > 0 even when the code is running in the hardirq
or NMI context for hardirqs and NMIs that happened during a softirq.

As a result, if a softirq handler contains a remote coverage collection
section and a hardirq with another remote coverage collection section
happens during handling the softirq, KCOV incorrectly detects a nested
softirq coverate collection section and prints a WARNING, as reported by
syzbot.

This issue was exposed by commit a7f3813e58 ("usb: gadget: dummy_hcd:
Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
hrtimer and made the timer's callback be executed in the hardirq context.

Change the related checks in KCOV to account for this behavior of
in_serving_softirq() and make KCOV ignore remote coverage collection
sections in the hardirq and NMI contexts.

This prevents the WARNING printed by syzbot but does not fix the inability
of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
is in use (caused by a7f3813e58); a separate patch is required for that.

Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
Fixes: 5ff3b30ab5 ("kcov: collect coverage from interrupts")
Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13ac
Acked-by: Marco Elver <elver@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marcello Sylvester Bauer <sylv@sylv.io>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07 18:33:56 -07:00
Jianhui Zhou
58f7e4d7ba ring-buffer: Remove unused function ring_buffer_nr_pages()
Because ring_buffer_nr_pages() is not an inline function and user accesses
buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages
is removed.

Signed-off-by: Jianhui Zhou <912460177@qq.com>
Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 20:26:44 -04:00
Tze-nan Wu
bcf86c01ca tracing: Fix overflow in get_free_elt()
"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.

Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.

Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com
Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 20:23:12 -04:00
Petr Pavlu
604b72b325 function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
When ftrace_graph_ret_addr() is invoked to convert a found stack return
address to its original value, the function can end up producing the
following crash:

[   95.442712] BUG: kernel NULL pointer dereference, address: 0000000000000028
[   95.442720] #PF: supervisor read access in kernel mode
[   95.442724] #PF: error_code(0x0000) - not-present page
[   95.442727] PGD 0 P4D 0-
[   95.442731] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
[   95.442736] CPU: 1 UID: 0 PID: 2214 Comm: insmod Kdump: loaded Tainted: G           OE K    6.11.0-rc1-default #1 67c62a3b3720562f7e7db5f11c1fdb40b7a2857c
[   95.442747] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
[   95.442750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[   95.442754] RIP: 0010:ftrace_graph_ret_addr+0x42/0xc0
[   95.442766] Code: [...]
[   95.442773] RSP: 0018:ffff979b80ff7718 EFLAGS: 00010006
[   95.442776] RAX: ffffffff8ca99b10 RBX: ffff979b80ff7760 RCX: ffff979b80167dc0
[   95.442780] RDX: ffffffff8ca99b10 RSI: ffff979b80ff7790 RDI: 0000000000000005
[   95.442783] RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
[   95.442786] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff8e9491e0
[   95.442790] R13: ffffffff8d6f70f0 R14: ffff979b80167da8 R15: ffff979b80167dc8
[   95.442793] FS:  00007fbf83895740(0000) GS:ffff8a0afdd00000(0000) knlGS:0000000000000000
[   95.442797] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.442800] CR2: 0000000000000028 CR3: 0000000005070002 CR4: 0000000000370ef0
[   95.442806] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   95.442809] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   95.442816] Call Trace:
[   95.442823]  <TASK>
[   95.442896]  unwind_next_frame+0x20d/0x830
[   95.442905]  arch_stack_walk_reliable+0x94/0xe0
[   95.442917]  stack_trace_save_tsk_reliable+0x7d/0xe0
[   95.442922]  klp_check_and_switch_task+0x55/0x1a0
[   95.442931]  task_call_func+0xd3/0xe0
[   95.442938]  klp_try_switch_task.part.5+0x37/0x150
[   95.442942]  klp_try_complete_transition+0x79/0x2d0
[   95.442947]  klp_enable_patch+0x4db/0x890
[   95.442960]  do_one_initcall+0x41/0x2e0
[   95.442968]  do_init_module+0x60/0x220
[   95.442975]  load_module+0x1ebf/0x1fb0
[   95.443004]  init_module_from_file+0x88/0xc0
[   95.443010]  idempotent_init_module+0x190/0x240
[   95.443015]  __x64_sys_finit_module+0x5b/0xc0
[   95.443019]  do_syscall_64+0x74/0x160
[   95.443232]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   95.443236] RIP: 0033:0x7fbf82f2c709
[   95.443241] Code: [...]
[   95.443247] RSP: 002b:00007fffd5ea3b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   95.443253] RAX: ffffffffffffffda RBX: 000056359c48e750 RCX: 00007fbf82f2c709
[   95.443257] RDX: 0000000000000000 RSI: 000056356ed4efc5 RDI: 0000000000000003
[   95.443260] RBP: 000056356ed4efc5 R08: 0000000000000000 R09: 00007fffd5ea3c10
[   95.443263] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[   95.443267] R13: 000056359c48e6f0 R14: 0000000000000000 R15: 0000000000000000
[   95.443272]  </TASK>
[   95.443274] Modules linked in: [...]
[   95.443385] Unloaded tainted modules: intel_uncore_frequency(E):1 isst_if_common(E):1 skx_edac(E):1
[   95.443414] CR2: 0000000000000028

The bug can be reproduced with kselftests:

 cd linux/tools/testing/selftests
 make TARGETS='ftrace livepatch'
 (cd ftrace; ./ftracetest test.d/ftrace/fgraph-filter.tc)
 (cd livepatch; ./test-livepatch.sh)

The problem is that ftrace_graph_ret_addr() is supposed to operate on the
ret_stack of a selected task but wrongly accesses the ret_stack of the
current task. Specifically, the above NULL dereference occurs when
task->curr_ret_stack is non-zero, but current->ret_stack is NULL.

Correct ftrace_graph_ret_addr() to work with the right ret_stack.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/20240803131211.17255-1-petr.pavlu@suse.com
Fixes: 7aa1eaef9f ("function_graph: Allow multiple users to attach to function graph")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 20:20:30 -04:00
Steven Rostedt
6e2fdceffd tracing: Use refcount for trace_event_file reference counter
Instead of using an atomic counter for the trace_event_file reference
counter, use the refcount interface. It has various checks to make sure
the reference counting is correct, and will warn if it detects an error
(like refcount_inc() on '0').

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240726144208.687cce24@rorschach.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 18:12:46 -04:00
Steven Rostedt
b156040869 tracing: Have format file honor EVENT_FILE_FL_FREED
When eventfs was introduced, special care had to be done to coordinate the
freeing of the file meta data with the files that are exposed to user
space. The file meta data would have a ref count that is set when the file
is created and would be decremented and freed after the last user that
opened the file closed it. When the file meta data was to be freed, it
would set a flag (EVENT_FILE_FL_FREED) to denote that the file is freed,
and any new references made (like new opens or reads) would fail as it is
marked freed. This allowed other meta data to be freed after this flag was
set (under the event_mutex).

All the files that were dynamically created in the events directory had a
pointer to the file meta data and would call event_release() when the last
reference to the user space file was closed. This would be the time that it
is safe to free the file meta data.

A shortcut was made for the "format" file. It's i_private would point to
the "call" entry directly and not point to the file's meta data. This is
because all format files are the same for the same "call", so it was
thought there was no reason to differentiate them.  The other files
maintain state (like the "enable", "trigger", etc). But this meant if the
file were to disappear, the "format" file would be unaware of it.

This caused a race that could be trigger via the user_events test (that
would create dynamic events and free them), and running a loop that would
read the user_events format files:

In one console run:

 # cd tools/testing/selftests/user_events
 # while true; do ./ftrace_test; done

And in another console run:

 # cd /sys/kernel/tracing/
 # while true; do cat events/user_events/__test_event/format; done 2>/dev/null

With KASAN memory checking, it would trigger a use-after-free bug report
(which was a real bug). This was because the format file was not checking
the file's meta data flag "EVENT_FILE_FL_FREED", so it would access the
event that the file meta data pointed to after the event was freed.

After inspection, there are other locations that were found to not check
the EVENT_FILE_FL_FREED flag when accessing the trace_event_file. Add a
new helper function: event_file_file() that will make sure that the
event_mutex is held, and will return NULL if the trace_event_file has the
EVENT_FILE_FL_FREED flag set. Have the first reference of the struct file
pointer use event_file_file() and check for NULL. Later uses can still use
the event_file_data() helper function if the event_mutex is still held and
was not released since the event_file_file() call.

Link: https://lore.kernel.org/all/20240719204701.1605950-1-minipli@grsecurity.net/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers   <mathieu.desnoyers@efficios.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Ilkka Naulapää    <digirigawa@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al   Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter   <dan.carpenter@linaro.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Florian Fainelli  <florian.fainelli@broadcom.com>
Cc: Alexey Makhalov    <alexey.makhalov@broadcom.com>
Cc: Vasavi Sirnapalli    <vasavi.sirnapalli@broadcom.com>
Link: https://lore.kernel.org/20240730110657.3b69d3c1@gandalf.local.home
Fixes: b63db58e2f ("eventfs/tracing: Add callback for release of an eventfs_inode")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 18:12:46 -04:00