There are three sets of updates for 5.18 in the asm-generic tree:
- The set_fs()/get_fs() infrastructure gets removed for good. This
was already gone from all major architectures, but now we can
finally remove it everywhere, which loses some particularly
tricky and error-prone code.
There is a small merge conflict against a parisc cleanup, the
solution is to use their new version.
- The nds32 architecture ends its tenure in the Linux kernel. The
hardware is still used and the code is in reasonable shape, but
the mainline port is not actively maintained any more, as all
remaining users are thought to run vendor kernels that would never
be updated to a future release.
There are some obvious conflicts against changes to the removed
files.
- A series from Masahiro Yamada cleans up some of the uapi header
files to pass the compile-time checks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmI69BsACgkQmmx57+YA
GNn/zA//f4d5VTT0ThhRxRWTu9BdThGHoB8TUcY7iOhbsWu0X/913NItRC3UeWNl
IdmisaXgVtirg1dcC2pWUmrcHdoWOCEGfK4+Zr2NhSWfuZDWvODHK9pGWk4WLnhe
cQgUNBvIuuAMryGtrOBwHPO4TpfCyy2ioeVP36ZfcsWXdDxTrqfaq/56mk3sxIP6
sUTk1UEjut9NG4C9xIIvcSU50R3l6LryQE/H9kyTLtaSvfvTOvprcVYCq0GPmSzo
DtQ1Wwa9zbJ+4EqoMiP5RrgQwWvOTg2iRByLU8ytwlX3e/SEF0uihvMv1FQbL8zG
G8RhGUOKQSEhaBfc3lIkm8GpOVPh0uHzB6zhn7daVmAWtazRD2Nu59BMjipa+ims
a8Z58iHH7jRAnKeEkVZqXKb1CEiUxaQx/IeVPzN4QlwMhDtwrI76LY7ZJ1zCqTGY
ENG0yRLav1XselYBslOYXGtOEWcY5EZPWqLyWbp4P9vz2g0Fe0gZxoIOvPmNQc89
QnfXpCt7vm/DGkyO255myu08GOLeMkisVqUIzLDB9avlym5mri7T7vk9abBa2YyO
CRpTL5gl1/qKPWuH1UI5mvhT+sbbBE2SUHSuy84btns39ZKKKynwCtdu+hSQkKLE
h9pV30Gf1cLTD4JAE0RWlUgOmbBLVp34loTOexQj4MrLM1noOnw=
=vtCN
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"There are three sets of updates for 5.18 in the asm-generic tree:
- The set_fs()/get_fs() infrastructure gets removed for good.
This was already gone from all major architectures, but now we can
finally remove it everywhere, which loses some particularly tricky
and error-prone code. There is a small merge conflict against a
parisc cleanup, the solution is to use their new version.
- The nds32 architecture ends its tenure in the Linux kernel.
The hardware is still used and the code is in reasonable shape, but
the mainline port is not actively maintained any more, as all
remaining users are thought to run vendor kernels that would never
be updated to a future release.
- A series from Masahiro Yamada cleans up some of the uapi header
files to pass the compile-time checks"
* tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (27 commits)
nds32: Remove the architecture
uaccess: remove CONFIG_SET_FS
ia64: remove CONFIG_SET_FS support
sh: remove CONFIG_SET_FS support
sparc64: remove CONFIG_SET_FS support
lib/test_lockup: fix kernel pointer check for separate address spaces
uaccess: generalize access_ok()
uaccess: fix type mismatch warnings from access_ok()
arm64: simplify access_ok()
m68k: fix access_ok for coldfire
MIPS: use simpler access_ok()
MIPS: Handle address errors for accesses above CPU max virtual user address
uaccess: add generic __{get,put}_kernel_nofault
nios2: drop access_ok() check from __put_user()
x86: use more conventional access_ok() definition
x86: remove __range_not_ok()
sparc64: add __{get,put}_kernel_nofault()
nds32: fix access_ok() checks in get/put_user
uaccess: fix nios2 and microblaze get_user_8()
sparc64: fix building assembly files
...
Pull cgroup updates from Tejun Heo:
"All trivial cleanups without meaningful behavior changes"
* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: cleanup comments
cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment
cgroup: rstat: retrieve current bstat to delta directly
cgroup: rstat: use same convention to assign cgroup_base_stat
Pull workqueue updates from Tejun Heo:
"Nothing major. Just follow-up cleanups from Lai after the earlier
synchronization simplification"
* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Convert the type of pool->nr_running to int
workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code
workqueue: Change the comments of the synchronization about the idle_list
workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()
- New user_events interface. User space can register an event with the kernel
describing the format of the event. Then it will receive a byte in a page
mapping that it can check against. A privileged task can then enable that
event like any other event, which will change the mapped byte to true,
telling the user space application to start writing the event to the
tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When set,
the tracing buffer will be saved in the snapshot buffer at boot up when
the kernel hands things over to user space. This will keep the traces that
happened at boot up available even if user space boot up has tracing as
well.
- Have TRACE_EVENT_ENUM() also update trace event field type descriptions.
Thus if a static array defines its size with an enum, the user space trace
event parsers can still know how to parse that array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This will
make one tracepoint be able to trace different content and not be stuck at
only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events for
mapping).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYjiO3RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhQzAQDtek5p80p/zkMGymm14wSH6qq0NdgN
Kv7fTBwEewUa0gD/UCOVLw4Oj+JtHQhCa3sCGZopmRv0BT1+4UQANqosKQY=
=Au08
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- New user_events interface. User space can register an event with the
kernel describing the format of the event. Then it will receive a
byte in a page mapping that it can check against. A privileged task
can then enable that event like any other event, which will change
the mapped byte to true, telling the user space application to start
writing the event to the tracing buffer.
- Add new "ftrace_boot_snapshot" kernel command line parameter. When
set, the tracing buffer will be saved in the snapshot buffer at boot
up when the kernel hands things over to user space. This will keep
the traces that happened at boot up available even if user space boot
up has tracing as well.
- Have TRACE_EVENT_ENUM() also update trace event field type
descriptions. Thus if a static array defines its size with an enum,
the user space trace event parsers can still know how to parse that
array.
- Add new TRACE_CUSTOM_EVENT() macro. This acts the same as the
TRACE_EVENT() macro, but will attach to an existing tracepoint. This
will make one tracepoint be able to trace different content and not
be stuck at only what the original TRACE_EVENT() macro exports.
- Fixes to tracing error logging.
- Better saving of cmdlines to PIDs when tracing (use the wakeup events
for mapping).
* tag 'trace-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits)
tracing: Have type enum modifications copy the strings
user_events: Add trace event call as root for low permission cases
tracing/user_events: Use alloc_pages instead of kzalloc() for register pages
tracing: Add snapshot at end of kernel boot up
tracing: Have TRACE_DEFINE_ENUM affect trace event types as well
tracing: Fix strncpy warning in trace_events_synth.c
user_events: Prevent dyn_event delete racing with ioctl add/delete
tracing: Add TRACE_CUSTOM_EVENT() macro
tracing: Move the defines to create TRACE_EVENTS into their own files
tracing: Add sample code for custom trace events
tracing: Allow custom events to be added to the tracefs directory
tracing: Fix last_cmd_set() string management in histogram code
user_events: Fix potential uninitialized pointer while parsing field
tracing: Fix allocation of last_cmd in last_cmd_set()
user_events: Add documentation file
user_events: Add sample code for typical usage
user_events: Add self-test for validator boundaries
user_events: Add self-test for perf_event integration
user_events: Add self-test for dynamic_events integration
user_events: Add self-test for ftrace integration
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmI4ggsACgkQUqAMR0iA
lPLrlA//R12HGCGSzdpdyynl+5wByIqcHe8RANHOAj9f9qxBtmYv2ZK69mzSvhHO
6kAGdb3vBtxo1NCHeqxlXpds9GP/zOGEWmEJP2P7pIZ8ci8QtwrXCtQ8XIW9UGhJ
WHzXpXkfzcIDsRZs6B1pxN5cRXuW2VVzfgxyu6L+hvNV0o0PPO4A48ptzNBZh8rj
URAid+n/aGs9SOXM0h8SRjjBYEqjiB2RZ3gLg5XGZmcATtitBO135LGZnBR2fwnX
RZKckbdA/fBzqS4Njsp2rV5Rqldwj7mHzQbcsQm4YDrxSdl8d78XxQdAA5sNyaCD
ToDw6/DeegXzgtPJpuBH/ymF9RczIu4l3eawO1FBMCB5EPq56zVHWErxry8qaTgi
yQFqhBgifNN5NqfQCn7dyF10usmsvImFczre7ZxJvL7vmzqDsYYqdZG5oouLudR4
iOphFwX71v4X+RsxbOXqEt+mS3AwqEJc1SZl5rrDc4TSUOE1qCd+ncLTAuAf3Wfm
1xaZ+siomahcZAKrgmSw6AcD5bU+JJpr6FktKAddiO7J1+nIdT1lYEbpUsfWZ/p8
Kx8A2M2ula+whJ6CgtnGTsbsacsFi+j/MioMTGZIU+Fubkig3XEeIp3QUO6sEN+9
/sUQ6Wj6c95miWdttff9o6ap8py9NbfuKIw/HMOesfLVKP82rmw=
=EUpJ
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Make %pK behave the same as %p for kptr_restrict == 0 also with
no_hash_pointers parameter
- Ignore the default console in the device tree also when console=null
or console="" is used on the command line
- Document console=null and console="" behavior
- Prevent a deadlock and a livelock caused by console_lock in panic()
- Make console_lock available for panicking CPU
- Fast query for the next to-be-used sequence number
- Use the expected return values in printk.devkmsg __setup handler
- Use the correct atomic operations in wake_up_klogd() irq_work handler
- Avoid possible unaligned access when handling %4cc printing format
* tag 'printk-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: fix return value of printk.devkmsg __setup handler
vsprintf: Fix %pK with kptr_restrict == 0
printk: make suppress_panic_printk static
printk: Set console_set_on_cmdline=1 when __add_preferred_console() is called with user_specified == true
Docs: printk: add 'console=null|""' to admin/kernel-parameters
printk: use atomic updates for klogd work
printk: Drop console_sem during panic
printk: Avoid livelock with heavy printk during panic
printk: disable optimistic spin during panic
printk: Add panic_in_progress helper
vsprintf: Move space out of string literals in fourcc_string()
vsprintf: Fix potential unaligned access
printk: ringbuffer: Improve prb_next_seq() performance
If a trace event has in its TP_printk():
"%*.s", len, len ? __get_str(string) : NULL
It is perfectly valid if len is zero and passing in the NULL.
Unfortunately, the runtime string check at time of reading the trace sees
the NULL and flags it as a bad string and produces a WARN_ON().
Handle this case by passing into the test function if the format has an
asterisk (star) and if so, if the length is zero, then mark it as safe.
Link: https://lore.kernel.org/all/YjsWzuw5FbWPrdqq@bfoster/
Cc: stable@vger.kernel.org
Reported-by: Brian Foster <bfoster@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 9a6944fee6 ("tracing: Add a verifier to check string pointers for trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.
In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.
In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.
The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.
It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,
a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.
b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.
The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.
The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.
In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.
The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,
- 0: NUMA_BALANCING_DISABLED
- 1: NUMA_BALANCING_NORMAL
- 2: NUMA_BALANCING_MEMORY_TIERING
We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.
Thanks Andrew Morton to help fix the document format error.
Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: enforce pageblock_order < MAX_ORDER".
Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
cases and some parts of the kernel are not prepared for it.
For example, Aneesh has shown [1] that such kernels can be compiled on
ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will
run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right
during boot.
We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
bigger than the maximum allocation granularity of the buddy, in which
case we are no longer talking about huge pages but instead gigantic
pages.
Having pageblock_order >= MAX_ORDER can only make alloc_contig_range()
of such gigantic pages more likely to succeed.
Reliable use of gigantic pages either requires boot time allcoation or
CMA, no need to overcomplicate some places in the kernel to optimize for
corner cases that are broken in other areas of the kernel.
This patch (of 2):
Let's enforce pageblock_order < MAX_ORDER and simplify.
Especially patch #1 can be regarded a cleanup before:
[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
alignment. [2]
[1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com
[2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20220214174132.219303-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Rob Herring <robh@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler build,
from the fast-headers tree - including placeholder *_api.h headers for
later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI5rg8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGrw/+M3QOk6fH7G48wjlNnBvcOife6ls+Ni4k
ixOAcF4JKoixO8HieU5vv0A7yf/83tAa6fpeXeMf1hkCGc0NSlmLtuIux+WOmoAL
LzCyDEYfiP8KnVh0A1Tui/lK0+AkGo21O6ADhQE2gh8o2LpslOHQMzvtyekSzeeb
mVxMYQN+QH0m518xdO2D8IQv9ctOYK0eGjmkqdNfntOlytypPZHeNel/tCzwklP/
dElJUjNiSKDlUgTBPtL3DfpoLOI/0mHF2p6NEXvNyULxSOqJTu8pv9Z2ADb2kKo1
0D56iXBDngMi9MHIJLgvzsA8gKzHLFSuPbpODDqkTZCa28vaMB9NYGhJ643NtEie
IXTJEvF1rmNkcLcZlZxo0yjL0fjvPkczjw4Vj27gbrUQeEBfb4mfuI4BRmij63Ep
qEkgQTJhduCqqrQP1rVyhwWZRk1JNcVug+F6N42qWW3fg1xhj0YSrLai2c9nPez6
3Zt98H8YGS1Z/JQomSw48iGXVqfTp/ETI7uU7jqHK8QcjzQ4lFK5H4GZpwuqGBZi
NJJ1l97XMEas+rPHiwMEN7Z1DVhzJLCp8omEj12QU+tGLofxxwAuuOVat3CQWLRk
f80Oya3TLEgd22hGIKDRmHa22vdWnNQyS0S15wJotawBzQf+n3auS9Q3/rh979+t
ES/qvlGxTIs=
=Z8uT
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Cleanups for SCHED_DEADLINE
- Tracing updates/fixes
- CPU Accounting fixes
- First wave of changes to optimize the overhead of the scheduler
build, from the fast-headers tree - including placeholder *_api.h
headers for later header split-ups.
- Preempt-dynamic using static_branch() for ARM64
- Isolation housekeeping mask rework; preperatory for further changes
- NUMA-balancing: deal with CPU-less nodes
- NUMA-balancing: tune systems that have multiple LLC cache domains per
node (eg. AMD)
- Updates to RSEQ UAPI in preparation for glibc usage
- Lots of RSEQ/selftests, for same
- Add Suren as PSI co-maintainer
* tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
sched/headers: ARM needs asm/paravirt_api_clock.h too
sched/numa: Fix boot crash on arm64 systems
headers/prep: Fix header to build standalone: <linux/psi.h>
sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y
cgroup: Fix suspicious rcu_dereference_check() usage warning
sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers
sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains
sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
sched/deadline,rt: Remove unused functions for !CONFIG_SMP
sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently
sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy()
sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file
sched/deadline: Remove unused def_dl_bandwidth
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
sched/cpuacct: Remove redundant RCU read lock
sched/cpuacct: Optimize away RCU read lock
sched/cpuacct: Fix charge percpu cpuusage
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
...
- bitops & cpumask:
- Always inline various generic helpers, to improve code generation,
but also for instrumentation, found by noinstr validation.
- Add a x86-specific cpumask_clear_cpu() helper to improve code generation
- atomics:
- Fix atomic64_{read_acquire,set_release} fallbacks
- lockdep:
- Fix /proc/lockdep output loop iteration for classes
- Fix /proc/lockdep potential access to invalid memory
- minor cleanups
- Add Mark Rutland as reviewer for atomic primitives
- jump labels:
- Clean up the code a bit
- misc:
- Add __sched annotations to percpu rwsem primitives
- Enable RT_MUTEXES on PREEMPT_RT by default
- Stray v8086_mode() inlining fix, result of noinstr objtool validation
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI4XQgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1imLg//SusL4SW7xWprktpltACjjOk2UDB6x26A
GfG3vOxjdqZ1qCrVQqNHialOTj3Wci2HxAarKui9of9o7ueEQNGsyvMQte8xJUhw
osWDFbTlzr2WmkH8I5FPtPq30P7ulcOa6eZNO/1M2IIvXYQkGYgTosXRPmD/fIKA
qJgw2V7B8QME9rHT/0kLSlhTzHjvu0y1dK9rTr5oVocZER1e/cXVFkSUz/uGL/XH
/mpWzD/dwGXvrbgGbewvzZ0L7jO/EH3/ZAUDgsksebRSqa3+Ln3Gm8mMA5Hx1Vpm
a4CMi7hrCJ1ZWSnleDRtxDAgHG20BDKFMLxsTPAySoy4dQ+KT2KieAlo7U3L1ABJ
G7xQfS/OUd/mRptXUQYTfv5wfGt/xqZAyV31RTQJElKetWBcL1du4uc4g4fITgVN
8zpIOBK7AyeiSLCG4LLN3ROa5oYPoCawsUkokeaewiasacvDKquDEj/ZtUH7eNCm
1AGM2RCJim2YpWyGzX3jrCMK9/ERZjw0MJUDUXpUIUE1NBuoWhkWpuYbu+P0JQ+D
0Z3Hxo/4JYnF1nEH7a87q0QBr7QnHFW8fUgxuR5o5c5ks+kc4ym3tUT6Wi9mzDug
PfFbTiP1AAWv65fvCVjZP/P+tL8019hRGhCWH9tkXNTxwSJJi2Ca7CGKH+4UI7bR
uAkFrWht4K0=
=04kk
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Changes in this cycle were:
Bitops & cpumask:
- Always inline various generic helpers, to improve code generation,
but also for instrumentation, found by noinstr validation.
- Add a x86-specific cpumask_clear_cpu() helper to improve code
generation
Atomics:
- Fix atomic64_{read_acquire,set_release} fallbacks
Lockdep:
- Fix /proc/lockdep output loop iteration for classes
- Fix /proc/lockdep potential access to invalid memory
- Add Mark Rutland as reviewer for atomic primitives
- Minor cleanups
Jump labels:
- Clean up the code a bit
Misc:
- Add __sched annotations to percpu rwsem primitives
- Enable RT_MUTEXES on PREEMPT_RT by default
- Stray v8086_mode() inlining fix, result of noinstr objtool
validation"
* tag 'locking-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_label: Refactor #ifdef of struct static_key
jump_label: Avoid unneeded casts in STATIC_KEY_INIT_{TRUE,FALSE}
locking/lockdep: Iterate lock_classes directly when reading lockdep files
x86/ptrace: Always inline v8086_mode() for instrumentation
cpumask: Add a x86-specific cpumask_clear_cpu() helper
locking: Enable RT_MUTEXES by default on PREEMPT_RT.
locking/local_lock: Make the empty local_lock_*() function a macro.
atomics: Fix atomic64_{read_acquire,set_release} fallbacks
locking: Add missing __sched attributes
cpumask: Always inline helpers which use bit manipulation functions
asm-generic/bitops: Always inline all bit manipulation helpers
locking/lockdep: Avoid potential access of invalid memory in lock_class
lockdep: Use memset_startat() helper in reinit_class()
MAINTAINERS: add myself as reviewer for atomics
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmI4WdIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jdTA/7BADTYzFCbdwPzHt2mR8osv7k+pDvYxs9
wxNjyi1X7N8cPkhqgIg9CfdhdyDOqo7+J4fG17f2qbwjNK7b2Fb1/U6ZoZaf+f8F
W0e2LX5KZTXUhkA+TEjrXvYD9FmJaCPM/l2RQg8U7okBs2kb0H6QT2Yn21wd1roC
WwI5KFiWSVS1IzpVLaXjDh+FJfJHd75ReMqJeus+QoVQ9NHeuI+t4DglSB1IBi54
d/zeVXE/Y4dFTQOrU06S2HxcOEptvXZsPmVLvKab/veeGGyWiGPxQpvu6bXm6u3x
0sV+dn67zut2m2pQlUZUucgGTSYIZTpOe+rNukTB9hJ4XeN4/1ohOOCrOuYM+63P
lGFbN1v+LD7Wc6C2eEhw8G5GEL0qbwzFNQ06O3EOFi7C7GKn7WS/ET6XuuMOERFk
uxEPb4pFtbBlJ0SriCprFJSd5NL3PORZlLIhv4hGH5hilLR1TFeKDuwZaM4noQxU
dL3rKGLi9H+P46Eni9H28+0gDISbv1xL+WivHOFQNmhBqAZO52ZcF3J+dgBaR1B5
pBxVTycFpZMjxSZnqTE0gMsFaLIpVGc+75Chns1rajR0mEtRtJUQUbYz4tK4zb0E
dZR1p+VF6+DYmSRhiqeaTi9uz9oE8kMa8o/EcbFIg/9BgEnUwJXU20bjnar30xQ7
9OIn7r9hjHI=
=XPuo
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event updates from Ingo Molnar:
- Fix address filtering for Intel/PT,ARM/CoreSight
- Enable Intel/PEBS format 5
- Allow more fixed-function counters for x86
- Intel/PT: Enable not recording Taken-Not-Taken packets
- Add a few branch-types
* tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix the build on !CONFIG_PHYS_ADDR_T_64BIT
perf: Add irq and exception return branch types
perf/x86/intel/uncore: Make uncore_discovery clean for 64 bit addresses
perf/x86/intel/pt: Add a capability and config bit for disabling TNTs
perf/x86/intel/pt: Add a capability and config bit for event tracing
perf/x86/intel: Increase max number of the fixed counters
KVM: x86: use the KVM side max supported fixed counter
perf/x86/intel: Enable PEBS format 5
perf/core: Allow kernel address filter when not filtering the kernel
perf/x86/intel/pt: Fix address filter config for 32-bit kernel
perf/core: Fix address filter parser for multiple filters
x86: Share definition of __is_canonical_address()
perf/x86/intel/pt: Relax address filter validation
Alexei Starovoitov says:
====================
pull-request: bpf-next 2022-03-21 v2
We've added 137 non-merge commits during the last 17 day(s) which contain
a total of 143 files changed, 7123 insertions(+), 1092 deletions(-).
The main changes are:
1) Custom SEC() handling in libbpf, from Andrii.
2) subskeleton support, from Delyan.
3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao.
4) Fix net.core.bpf_jit_harden race, from Hou.
5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub.
6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami.
The arch specific bits will come later.
7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri.
8) Enable non-atomic allocations in local storage, from Joanne.
9) Various var_off ptr_to_btf_id fixed, from Kumar.
10) bpf_ima_file_hash helper, from Roberto.
11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits)
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
bpftool: Fix a bug in subskeleton code generation
bpf: Fix bpf_prog_pack when PMU_SIZE is not defined
bpf: Fix bpf_prog_pack for multi-node setup
bpf: Fix warning for cast from restricted gfp_t in verifier
bpf, arm: Fix various typos in comments
libbpf: Close fd in bpf_object__reuse_map
bpftool: Fix print error when show bpf map
bpf: Fix kprobe_multi return probe backtrace
Revert "bpf: Add support to inline bpf_get_func_ip helper on x86"
bpf: Simplify check in btf_parse_hdr()
selftests/bpf/test_lirc_mode2.sh: Exit with proper code
bpf: Check for NULL return from bpf_get_btf_vmlinux
selftests/bpf: Test skipping stacktrace
bpf: Adjust BPF stack helper functions to accommodate skip > 0
bpf: Select proper size for bpf_prog_pack
...
====================
Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting PTRACE_O_SUSPEND_SECCOMP is supposed to be a highly privileged
operation because it allows the tracee to completely bypass all seccomp
filters on kernels with CONFIG_CHECKPOINT_RESTORE=y. It is only supposed to
be settable by a process with global CAP_SYS_ADMIN, and only if that
process is not subject to any seccomp filters at all.
However, while these permission checks were done on the PTRACE_SETOPTIONS
path, they were missing on the PTRACE_SEIZE path, which also sets
user-specified ptrace flags.
Move the permissions checks out into a helper function and let both
ptrace_attach() and ptrace_setoptions() call it.
Cc: stable@kernel.org
Fixes: 13c4a90119 ("seccomp: add ptrace options for suspend/resume")
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lkml.kernel.org/r/20220319010838.1386861-1-jannh@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- NFSv3 support in NFSD is now always built
- Added NFSD support for the NFSv4 birth-time file attribute
- Added support for storing and displaying sockaddrs in trace points
- NFSD now recognizes RPC_AUTH_TLS probes
Performance improvements:
- Optimized the svc transport enqueuing mechanism
- Added micro-optimizations for the duplicate reply cache
Notable bug fixes:
- Allocation of the NFSD file cache hash table is more reliable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmI3XNkACgkQM2qzM29m
f5dNERAAqJ/nzfVp2H5BKLszJ7p/s13wFqbW719Rzymzl6/tUUHOqIsVdBsJsa/b
BdZQLLDwa6ZB5zOAnC6FosRKYu4lwixOOf94pC6a9ZDD/glYVKF8mZG+RZXPAy16
g3JUOi/bcyHXv0ZUhbv7DqW+HHM+owPP4vlNJ9ChiiLr/Xdp8NBKj+4Qtn/wcAo+
Xuvx7fU/5Mbemh+dd5mWker4afHvpt9V6U6s04m5LiTPPnHVnxmeyekJGUCOY0QO
cm/6SPNDqyn/VEfM/SRxEnLE9QcHRhZo/4PKRGF4wYolcviIogbILE1M7Ig/r/Gv
6Du2kcRAhyZ3zgWnu799Ivn3Q6IrVjxZwqmsi7YHURTwYKyZtxYsUk0MCBcpnxrE
WyTS2onpElbMop3viKCnQdpIetbsHnUNg3udUV6ugbdCbnZuhUw5B/d6x0o8ZWDE
C0f+jnX+GnBstn0vkcj2H0+VQTd5hUJtXMrooI42ODJoboQRZcmePwoXjqCmw3sy
PXTxLZC/5+4zNHGUuz4Pq4V7FKr4nHhDzaW4ZDO3mILx4ahceotulY1B/yoBUu8o
/LAhu2kJ6nFQkmpzdrGzPeOstgJYHm9CaitRvMzg+NJxEAJdebypdQDbX5iNpgfW
MDXH4n8eIqroTlQ/mQYEV0EbC7BaTqSCL6rQdcrcFUPu9n4Fcno=
=5nac
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"New features:
- NFSv3 support in NFSD is now always built
- Added NFSD support for the NFSv4 birth-time file attribute
- Added support for storing and displaying sockaddrs in trace points
- NFSD now recognizes RPC_AUTH_TLS probes
Performance improvements:
- Optimized the svc transport enqueuing mechanism
- Added micro-optimizations for the duplicate reply cache
Notable bug fixes:
- Allocation of the NFSD file cache hash table is more reliable"
* tag 'nfsd-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (30 commits)
nfsd: fix using the correct variable for sizeof()
nfsd: use correct format characters
NFSD: prevent integer overflow on 32 bit systems
NFSD: prevent underflow in nfssvc_decode_writeargs()
fs/lock: documentation cleanup. Replace inode->i_lock with flc_lock.
NFSD: Fix nfsd_breaker_owns_lease() return values
NFSD: Clean up _lm_ operation names
arch: Remove references to CONFIG_NFSD_V3 in the default configs
NFSD: Remove CONFIG_NFSD_V3
nfsd: more robust allocation failure handling in nfsd_file_cache_init
SUNRPC: Teach server to recognize RPC_AUTH_TLS
NFSD: Move svc_serv_ops::svo_function into struct svc_serv
NFSD: Remove svc_serv_ops::svo_module
SUNRPC: Remove svc_shutdown_net()
SUNRPC: Rename svc_close_xprt()
SUNRPC: Rename svc_create_xprt()
SUNRPC: Remove svo_shutdown method
SUNRPC: Merge svc_do_enqueue_xprt() into svc_enqueue_xprt()
SUNRPC: Remove the .svo_enqueue_xprt method
SUNRPC: Record endpoint information in trace log
...
Qian Cai reported a boot crash on arm64 systems, caused by:
0fb3978b0a ("sched/numa: Fix NUMA topology for systems with CPU-less nodes")
The bug is that node_state() must be supplied a valid node_states[] array index,
but in task_numa_placement() the max_nid search can fail with NUMA_NO_NODE,
which is not a valid index.
Fix it by checking that max_nid is a valid index.
[ mingo: Added changelog. ]
Fixes: 0fb3978b0a ("sched/numa: Fix NUMA topology for systems with CPU-less nodes")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmI474wUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPiAQ/8DM7Za/Fef1dPq2Vl/hWa3I0qfVNt
JxlP2kuyzItkmvjoF+eXeRq0Zv32WDSCoXhKwlAaonchXuKbM7NDkyvPNXQlvQng
Bfs5u2gPJvGWPmkrwGJNZn1I6nlujwsnVk9td4Un8OzLT9RsB4godH8eqCQEZC/M
vj5JI+YYMo3iaciCNcKm+H6MCDr7X+rN6b8e8DK8JzYpfIwRsHmRIHix0BIVk28W
mh8ZtFRgOY3DXdYTYMbatOpZDvtSbhKYiIBiEeHygrviUABOe/pewXGoP3myvl7L
S7qr8msOC9c/auugeGhukrT17bGmkloTPjS84LYPm9WySN8FwkwlqD3d4kqPKghD
fj6br/nSgV5bqa1HSh3cwyIHauC0sGOfGIoVvVZt1gIViuHBsBYM1Y2wau3Hg9y3
BfIXiHckFmWSfzlJDj4fsS+lv9BdJwjeiMbepJQJ1btyMIUMu2V3MJwAXn2SzfuO
91feKrn3Fbkx7Xgg6dbZbt4BZhNURWRf6ZCZXR0oiDxUNfE+tI6s8wjMKfDzaUuu
Gj+BlvC+hOqgLczuSKQ1rK3D38uOl2qc2HwXlTuFdtCWNmF9AJ464YBT6UzrHGdB
8OIqMp+zaTxk7Mrx2AYqnddB/tAga1F0jVaIqkpW1s3EdSrDTEgHJniW0skMz6hw
/FouCKV8IhP7SNI=
=0YvJ
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20220321' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit update from Paul Moore:
"Just one audit patch queued for v5.18:
- Change the AUDIT_TIME_* record generation so that they are
generated at syscall exit time and subject to all of the normal
syscall exit filtering.
This should help reduce noise and ensure those records which are
most relevant to the admin's audit configuration are recorded in
the audit log"
* tag 'audit-pr-20220321' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: log AUDIT_TIME_* records only from rules
Pull watch_queue fixes from David Howells:
"Here are fixes for a couple more watch_queue bugs, both found by syzbot:
- Fix error cleanup in watch_queue_set_size() where it tries to clean
up all the pointers in the page list, even if they've not been
allocated yet[1]. Unfortunately, __free_page() doesn't treat a NULL
pointer as being "do nothing".
A second report[2] looks like it's probably the same bug, but on
arm64 rather than x86_64, but there's no reproducer.
- Fix a missing kfree in free_watch() to actually free the watch[3]"
Link: https://lore.kernel.org/r/000000000000b1807c05daad8f98@google.com/ [1]
Link: https://lore.kernel.org/r/000000000000035b9c05daae8a5e@google.com/ [2]
Link: https://lore.kernel.org/r/000000000000bc8eaf05dab91c63@google.com/ [3]
* 'keys-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
watch_queue: Actually free the watch
watch_queue: Fix NULL dereference in error cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmIzwtEACgkQSfxwEqXe
A67NCBAA1+U01HXx4ethmmy1m2pXHAIwngI7PP0QzyZtmoloWockdN1lRfQ1C0uJ
Whk/9Hc9G7iujznsxOnCS+LeNwRzd7CjtFbTgK+yGIRKwL9GFcVwA5nrifP9TjqZ
FWmTIomjjmA06YRYsNOdNSQdN6DdpQz8xLw0EqVOZerI4ITFErYlW8lLqOOKY99N
f9glQK75kh41SUgo+K3JSn46fhB95HldL6dYSZzjQ6QsVKBQuQTDE9ryfrH2XZDw
xI2nf/ycXPUBv7Bb+0op+7ES++CoDigM2nIyxapEj3ZkpplxL4M+cCIHq3Juzfwm
jDdbZbs5SqDszOQM/dvCJSR+S/D3QIKdv3fwwWHDTigByZdgpudT3rr9k7dY60Z8
aNvOzNWOzGH9/0boLl55WysF6cBQnazbgtzeWpzeuWFhAyfxN/DJx2sf8U+TmN6n
3bDUafamAvmkkIOoHUzOXfjo2lhXxlmRZ40rWVNX5JvcJj5+5jRmTawrQj+9fn8/
MhiIZ6KBDV1OxPwJzG6jm++JP6rgXfXsxduomO7cIEWs10itf/cE8WD9qJrtZTtg
kfjYUguFOd/QyzY0A1w6FD865vy8YhATk71Ywgwj9AI+cfH8QUajpDkXOutjop8x
8HBxIGx6Itgzilfuo5jpJxlVhNO3G6v1fX/A+mUMAfHufkmnfiQ=
=cyDR
-----END PGP SIGNATURE-----
Merge tag 'random-5.18-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"There have been a few important changes to the RNG's crypto, but the
intent for 5.18 has been to shore up the existing design as much as
possible with modern cryptographic functions and proven constructions,
rather than actually changing up anything fundamental to the RNG's
design.
So it's still the same old RNG at its core as before: it still counts
entropy bits, and collects from the various sources with the same
heuristics as before, and so forth. However, the cryptographic
algorithms that transform that entropic data into safe random numbers
have been modernized.
Just as important, if not more, is that the code has been cleaned up
and re-documented. As one of the first drivers in Linux, going back to
1.3.30, its general style and organization was showing its age and
becoming both a maintenance burden and an auditability impediment.
Hopefully this provides a more solid foundation to build on for the
future. I encourage you to open up the file in full, and maybe you'll
remark, "oh, that's what it's doing," and enjoy reading it. That, at
least, is the eventual goal, which this pull begins working toward.
Here's a summary of the various patches in this pull:
- /dev/urandom and /dev/random now do the same thing, per the patch
we discussed on the list. I think this is worth trying out. If it
does appear problematic, I've made sure to keep it standalone and
revertible without any conflicts.
- Fixes and cleanups for numerous integer type problems, locking
issues, and general code quality concerns.
- The input pool's LFSR has been replaced with a cryptographically
secure hash function, which has security and performance benefits
alike, and consequently allows us to count entropy bits linearly.
- The pre-init injection now uses a real hash function too, instead
of an LFSR or vanilla xor.
- The interrupt handler's fast_mix() function now uses one round of
SipHash, rather than the fake crypto that was there before.
- All additions of RDRAND and RDSEED now go through the input pool's
hash function, in part to mitigate ridiculous hypothetical CPU
backdoors, but more so to have a consistent interface for ingesting
entropy that's easy to analyze, making everything happen one way,
instead of a potpourri of different ways.
- The crng now works on per-cpu data, while also being in accordance
with the actual "fast key erasure RNG" design. This allows us to
fix several boot-time race complications associated with the prior
dynamically allocated model, eliminates much locking, and makes our
backtrack protection more robust.
- Batched entropy now erases doled out values so that it's backtrack
resistant.
- Working closely with Sebastian, the interrupt handler no longer
needs to take any locks at all, as we punt the
synchronized/expensive operations to a workqueue. This is
especially nice for PREEMPT_RT, where taking spinlocks in irq
context is problematic. It also makes the handler faster for the
rest of us.
- Also working with Sebastian, we now do the right thing on CPU
hotplug, so that we don't use stale entropy or fail to accumulate
new entropy when CPUs come back online.
- We handle virtual machines that fork / clone / snapshot, using the
"vmgenid" ACPI specification for retrieving a unique new RNG seed,
which we can use to also make WireGuard (and in the future, other
things) safe across VM forks.
- Around boot time, we now try to reseed more often if enough entropy
is available, before settling on the usual 5 minute schedule.
- Last, but certainly not least, the documentation in the file has
been updated considerably"
* tag 'random-5.18-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (60 commits)
random: check for signal and try earlier when generating entropy
random: reseed more often immediately after booting
random: make consistent usage of crng_ready()
random: use SipHash as interrupt entropy accumulator
wireguard: device: clear keys on VM fork
random: provide notifier for VM fork
random: replace custom notifier chain with standard one
random: do not export add_vmfork_randomness() unless needed
virt: vmgenid: notify RNG of VM fork and supply generation ID
ACPI: allow longer device IDs
random: add mechanism for VM forks to reinitialize crng
random: don't let 644 read-only sysctls be written to
random: give sysctl_random_min_urandom_seed a more sensible value
random: block in /dev/urandom
random: do crng pre-init loading in worker rather than irq
random: unify cycles_t and jiffies usage and types
random: cleanup UUID handling
random: only wake up writers after zap if threshold was passed
random: round-robin registers as ulong, not u32
random: clear fast pool, crng, and batches in cpuhp bring up
...
- Allow device_pm_check_callbacks() to be called from interrupt
context without issues (Dmitry Baryshkov).
- Modify devm_pm_runtime_enable() to automatically handle
pm_runtime_dont_use_autosuspend() at driver exit time (Douglas
Anderson).
- Make the schedutil cpufreq governor use to_gov_attr_set() instead
of open coding it (Kevin Hao).
- Replace acpi_bus_get_device() with acpi_fetch_acpi_dev() in the
cpufreq longhaul driver (Rafael Wysocki).
- Unify show() and store() naming in cpufreq and make it use
__ATTR_XX (Lianjie Zhang).
- Make the intel_pstate driver use the EPP value set by the firmware
by default (Srinivas Pandruvada).
- Re-order the init checks in the powernow-k8 cpufreq driver (Mario
Limonciello).
- Make the ACPI processor idle driver check for architectural
support for LPI to avoid using it on x86 by mistake (Mario
Limonciello).
- Add Sapphire Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy).
- Add 'preferred_cstates' module argument to the intel_idle driver
to work around C1 and C1E handling issue on Sapphire Rapids (Artem
Bityutskiy).
- Add core C6 optimization on Sapphire Rapids to the intel_idle
driver (Artem Bityutskiy).
- Optimize the haltpoll cpuidle driver a bit (Li RongQing).
- Remove leftover text from intel_idle() kerneldoc comment and fix
up white space in intel_idle (Rafael Wysocki).
- Fix load_image_and_restore() error path (Ye Bin).
- Fix typos in comments in the system wakeup hadling code (Tom Rix).
- Clean up non-kernel-doc comments in hibernation code (Jiapeng
Chong).
- Fix __setup handler error handling in system-wide suspend and
hibernation core code (Randy Dunlap).
- Add device name to suspend_report_result() (Youngjin Jang).
- Make virtual guests honour ACPI S4 hardware signature by
default (David Woodhouse).
- Block power off of a parent PM domain unless child is in deepest
state (Ulf Hansson).
- Use dev_err_probe() to simplify error handling for generic PM
domains (Ahmad Fatoum).
- Fix sleep-in-atomic bug caused by genpd_debug_remove() (Shawn Guo).
- Document Intel uncore frequency scaling (Srinivas Pandruvada).
- Add DTPM hierarchy description (Daniel Lezcano).
- Change the locking scheme in DTPM (Daniel Lezcano).
- Fix dtpm_cpu cleanup at exit time and missing virtual DTPM pointer
release (Daniel Lezcano).
- Make dtpm_node_callback[] static (kernel test robot).
- Fix spelling mistake "initialze" -> "initialize" in
dtpm_create_hierarchy() (Colin Ian King).
- Add tracer tool for the amd-pstate driver (Jinzhou Su).
- Fix PC6 displaying in turbostat on some systems (Artem Bityutskiy).
- Add AMD P-State support to the cpupower utility (Huang Rui).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmI4pM4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxh5wQAJEz3u55wIHzeov30obtXaD3SxxnvRzR
p96gRcmNoR2so/Q9D+h+JHZKQkVklbnbqExMXQn1qarceAUN7KPjVMRvagjZsC/f
J3LtQmx96yqGTCzOTu5n+Ol2ojKLMCMo++no/2873BYhd60TV6oQxRzkNiZx215n
tT6MKY5ZMX448VKWAWh9vt5rdvbBj9z6cfvpchK/3bziE21lfLz/1iXeFnwqjPGU
XuA7NYbVAHOfsdHZk19+4qAgm8EYkmjd4/J8HDlb7XouyLuUGy8KJZYhSrJKiQ1C
f9f2Zw0925/YpBmFXOwxuYWP9KjFKlq7Cdr3SSgVGDOvgyRtpeV4fU8Y6WPFCtEV
fQdKr9/4KQP6hwUpxJZucSf49wcnyh7hFDMxrwVVcL96yXZef1OqG3ITihJY/n4J
+wDnpR2VqBeiG5NyECjk3mPROZGFfUlHRsqMd3JOswMpGF5phpEI9nNFcayB262S
Rkgcb3MacFVsuo/ZBdzCUTZ6ECvjxZn4FGZPxumkp65SJO18gOPbqs8qfGCZ3Tgb
GDy0CWEOv/KuGnks1CkBGok2Z4q8s2GcZmaOp9BiPjxKJD71i4uPtiGA/5Ahb6cm
Cu0G7Ub/t2Vc93E7mnTE4hh2IuiAN73yB5teM4YNllHw6f+aqVGlvJktIMpShajo
eEBNFlkwljyz
=WlR9
-----END PGP SIGNATURE-----
Merge tag 'pm-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These are mostly fixes and cleanups all over the code and a new piece
of documentation for Intel uncore frequency scaling.
Functionality-wise, the intel_idle driver will support Sapphire Rapids
Xeons natively now (with some extra facilities for controlling
C-states more precisely on those systems), virtual guests will take
the ACPI S4 hardware signature into account by default, the
intel_pstate driver will take the defualt EPP value from the firmware,
cpupower utility will support the AMD P-state driver added in the
previous cycle, and there is a new tracer utility for that driver.
Specifics:
- Allow device_pm_check_callbacks() to be called from interrupt
context without issues (Dmitry Baryshkov).
- Modify devm_pm_runtime_enable() to automatically handle
pm_runtime_dont_use_autosuspend() at driver exit time (Douglas
Anderson).
- Make the schedutil cpufreq governor use to_gov_attr_set() instead
of open coding it (Kevin Hao).
- Replace acpi_bus_get_device() with acpi_fetch_acpi_dev() in the
cpufreq longhaul driver (Rafael Wysocki).
- Unify show() and store() naming in cpufreq and make it use
__ATTR_XX (Lianjie Zhang).
- Make the intel_pstate driver use the EPP value set by the firmware
by default (Srinivas Pandruvada).
- Re-order the init checks in the powernow-k8 cpufreq driver (Mario
Limonciello).
- Make the ACPI processor idle driver check for architectural support
for LPI to avoid using it on x86 by mistake (Mario Limonciello).
- Add Sapphire Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy).
- Add 'preferred_cstates' module argument to the intel_idle driver to
work around C1 and C1E handling issue on Sapphire Rapids (Artem
Bityutskiy).
- Add core C6 optimization on Sapphire Rapids to the intel_idle
driver (Artem Bityutskiy).
- Optimize the haltpoll cpuidle driver a bit (Li RongQing).
- Remove leftover text from intel_idle() kerneldoc comment and fix up
white space in intel_idle (Rafael Wysocki).
- Fix load_image_and_restore() error path (Ye Bin).
- Fix typos in comments in the system wakeup hadling code (Tom Rix).
- Clean up non-kernel-doc comments in hibernation code (Jiapeng
Chong).
- Fix __setup handler error handling in system-wide suspend and
hibernation core code (Randy Dunlap).
- Add device name to suspend_report_result() (Youngjin Jang).
- Make virtual guests honour ACPI S4 hardware signature by default
(David Woodhouse).
- Block power off of a parent PM domain unless child is in deepest
state (Ulf Hansson).
- Use dev_err_probe() to simplify error handling for generic PM
domains (Ahmad Fatoum).
- Fix sleep-in-atomic bug caused by genpd_debug_remove() (Shawn Guo).
- Document Intel uncore frequency scaling (Srinivas Pandruvada).
- Add DTPM hierarchy description (Daniel Lezcano).
- Change the locking scheme in DTPM (Daniel Lezcano).
- Fix dtpm_cpu cleanup at exit time and missing virtual DTPM pointer
release (Daniel Lezcano).
- Make dtpm_node_callback[] static (kernel test robot).
- Fix spelling mistake "initialze" -> "initialize" in
dtpm_create_hierarchy() (Colin Ian King).
- Add tracer tool for the amd-pstate driver (Jinzhou Su).
- Fix PC6 displaying in turbostat on some systems (Artem Bityutskiy).
- Add AMD P-State support to the cpupower utility (Huang Rui)"
* tag 'pm-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
cpufreq: powernow-k8: Re-order the init checks
cpuidle: intel_idle: Drop redundant backslash at line end
cpuidle: intel_idle: Update intel_idle() kerneldoc comment
PM: hibernate: Honour ACPI hardware signature by default for virtual guests
cpufreq: intel_pstate: Use firmware default EPP
cpufreq: unify show() and store() naming and use __ATTR_XX
PM: core: keep irq flags in device_pm_check_callbacks()
cpuidle: haltpoll: Call cpuidle_poll_state_init() later
Documentation: amd-pstate: add tracer tool introduction
tools/power/x86/amd_pstate_tracer: Add tracer tool for AMD P-state
tools/power/x86/intel_pstate_tracer: make tracer as a module
cpufreq: amd-pstate: Add more tracepoint for AMD P-State module
PM: sleep: Add device name to suspend_report_result()
turbostat: fix PC6 displaying on some systems
intel_idle: add core C6 optimization for SPR
intel_idle: add 'preferred_cstates' module argument
intel_idle: add SPR support
PM: runtime: Have devm_pm_runtime_enable() handle pm_runtime_dont_use_autosuspend()
ACPI: processor idle: Check for architectural support for LPI
cpuidle: PSCI: Move the `has_lpi` check to the beginning of the function
...
This pull request contains the following branches:
exp.2022.02.24a: Contains a fix for idle detection from Neeraj Upadhyay
and missing access marking detected by KCSAN.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Reduces coupling between rcu_barrier() and
CPU-hotplug operations, so that rcu_barrier() no longer needs
to do cpus_read_lock(). This may also someday allow system
boot to bring CPUs online concurrently.
rcu-tasks.2022.02.08a: Enable more aggressive movement to per-CPU
queueing when reacting to excessive lock contention due
to workloads placing heavy update-side stress on RCU tasks.
rt.2022.02.01b: Improvements to RCU priority boosting, including
changes from Neeraj Upadhyay, Zqiang, and Alison Chaiken.
torture.2022.02.01b: Various fixes improving test robustness and
debug information.
torturescript.2022.02.08a: Add tests for SRCU size transitions, further
compress torture.sh build products, and improve debug output.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmIusb0THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jAklD/9VXLK7crcg2YeRXUIg1IOdnancsVCV
MNtTfxNYqYIis+W2UfuHKuQu2yEXF5fihdY0J9TQv0byHsprp6FIZT+i1An4Ukgd
0vyHjd/DaIKgs2txsB1DjhlatWlJUfQuBwhtNUkpYFLFwKdCI1l813bPbNlL+GiL
p0ZejVMpBC5HgE6sDOtaaQSAB+AEUp+Lgr+yaG/On8hfzwWFKO8KldxhiKY9n07v
SNDfKDgXB+80hx4RBVGbkuogV3s9brFULoNRXJy7Uf79DtiY09uazhhA3G0TjO34
zGwmF91dqsXDF/Uz8g4aZO0xYRXUchOrsQ5lgO/GhTVbM9I0wWlMHEk/8WHyBJkU
vlXOMuwzBc9/5uwZE3rnkA4a3nkXhPQjLlCr+/I7A/7Vsv9IBW9WSlgMvUN0Qf4S
XAwTnIqfErnR60a+L0+HRr5kIV5VoXcxqI/Nv0/4/BMLRubS/c7cYjOTxXNJL9SU
50pv5vty9xk3HSpuz0JAOyLf+PUT773uUQhFr5xCBSCVqbAm5WFg6hWPAgrN/tUS
wstBc0wlA73rKVJxeLDQwHc/oT1zTUEzswVZITQ5zLHK0t0GbeR6QHccsdeaJyTe
DisX+66A6YQrEuJmx5xUZqjYHqtYLDOBTbHA3ZwQmvjKu8ibWZ8Fg9ioURLCS4bF
+FVkp/5KdcAN9w==
=ljVY
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Fix idle detection (Neeraj Upadhyay) and missing access marking
detected by KCSAN.
- Reduce coupling between rcu_barrier() and CPU-hotplug operations, so
that rcu_barrier() no longer needs to do cpus_read_lock(). This may
also someday allow system boot to bring CPUs online concurrently.
- Enable more aggressive movement to per-CPU queueing when reacting to
excessive lock contention due to workloads placing heavy update-side
stress on RCU tasks.
- Improvements to RCU priority boosting, including changes from Neeraj
Upadhyay, Zqiang, and Alison Chaiken.
- Various fixes improving test robustness and debug information.
- Add tests for SRCU size transitions, further compress torture.sh
build products, and improve debug output.
- Miscellaneous fixes.
* tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
rcu: Replace cpumask_weight with cpumask_empty where appropriate
rcu: Remove __read_mostly annotations from rcu_scheduler_active externs
rcu: Uninline multi-use function: finish_rcuwait()
rcu: Mark writes to the rcu_segcblist structure's ->flags field
kasan: Record work creation stack trace with interrupts enabled
rcu: Inline __call_rcu() into call_rcu()
rcu: Add mutex for rcu boost kthread spawning and affinity setting
rcu: Fix description of kvfree_rcu()
MAINTAINERS: Add Frederic and Neeraj to their RCU files
rcutorture: Provide non-power-of-two Tasks RCU scenarios
rcutorture: Test SRCU size transitions
torture: Make torture.sh help message match reality
rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
rcu-tasks: Use order_base_2() instead of ilog2()
rcu: Create and use an rcu_rdp_cpu_online()
rcu: Make rcu_barrier() no longer block CPU-hotplug operations
rcu: Rework rcu_barrier() and callback-migration logic
rcu: Refactor rcu_barrier() empty-list handling
rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
torture: Change KVM environment variable to RCUTORTURE
...
PMD_SIZE is not available in some special config, e.g. ARCH=arm with
CONFIG_MMU=n. Use bpf_prog_pack of PAGE_SIZE in these cases.
Fixes: ef078600ee ("bpf: Select proper size for bpf_prog_pack")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220321180009.1944482-3-song@kernel.org
module_alloc requires num_online_nodes * PMD_SIZE to allocate huge pages.
bpf_prog_pack uses pack of size num_online_nodes * PMD_SIZE.
OTOH, module_alloc returns addresses that are PMD_SIZE aligned (instead of
num_online_nodes * PMD_SIZE aligned). Therefore, PMD_MASK should be used
to calculate pack_ptr in bpf_prog_pack_free().
Fixes: ef078600ee ("bpf: Select proper size for bpf_prog_pack")
Reported-by: syzbot+c946805b5ce6ab87df0b@syzkaller.appspotmail.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220321180009.1944482-2-song@kernel.org
Core code:
- Provide generic_handle_irq_safe() which can be invoked from any
context (hard interrupt or threaded). This allows to remove ugly
workarounds in drivers all over the place.
- Use generic_handle_irq_safe() in the affected drivers.
- The usual cleanups and improvements.
Interrupt chip drivers:
- Support for new interrupt chips or not yet supported variants:
STM32MP14, Meson GPIO, Apple M1 PMU, Apple M1 AICv2, Qualcomm MPM
- Convert the Xilinx driver to generic interrupt domains
- Cleanup the irq_chip::name handling
- The usual cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4VmETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYkQEAC8Lo4tGLVKqcDewYLrLLtUk2GrbXW0
j/L0Su6xcdyCV6h2i8oB7SHBDP6nJ9dHOqR+Rk9/XZrDV5e32cnsY4TuipZ271B7
H+6VNykO60cbhRox+hHqe4bHO3qOdPUJvZ5DBt3+TC5T4P3/Il9DUIxWrDliqAzI
KewgNwGFXLmuvoTGpGoEKRbAHEXwYy6WMwDSmDhMl0CQOE6i5JM5Fkt25hQppq+T
4LZ3Peb4LoBtQJnT+VTgb3ZTlSJSVb82qR0mkldLmIGtdOD9sG8DhCYNzoB09hKz
Wiz+pQWrVcfX4H3lHwhInlaWU2Li/3LQ9hBqR1Kdm9NvT1e2eiVkPJGZc6sDsrg9
tAUutG6OXJpw0x9zvO98rfUiUg1Z9Ofnb5eR9NZi8/IMLqlxQ/HTpP1RBaDPmZ+V
srU16y7tX7wyqv/Le0yZlRY9y4TNHQPF+ffcu6dw4PFl4niXTNBzuVvkyCGNZl9C
5q6yu+0l/0scxoirzVEmWJOMQTY3TufObpTEY4j5e9Ji6xfM1JvL8rZkVV2DKZAw
NgJrkM+lFFdMMOfDCjqm5/b38LteU3etChwSnrgsSe/W26vIZP5eljKKAcxGDTzX
tdE5sq+EpjjHjFPOwjJNE6i37XucRtoGWGVL3xjbcsbLb2AVo4V7EbxVT3GBCsxL
oEGTL2hDt2UksQ==
=4eWv
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt updates from Thomas Gleixner:
"Core code:
- Provide generic_handle_irq_safe() which can be invoked from any
context (hard interrupt or threaded). This allows to remove ugly
workarounds in drivers all over the place.
- Use generic_handle_irq_safe() in the affected drivers.
- The usual cleanups and improvements.
Interrupt chip drivers:
- Support for new interrupt chips or not yet supported variants:
STM32MP14, Meson GPIO, Apple M1 PMU, Apple M1 AICv2, Qualcomm MPM
- Convert the Xilinx driver to generic interrupt domains
- Cleanup the irq_chip::name handling
- The usual cleanups and improvements all over the place"
* tag 'irq-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
irqchip: Add Qualcomm MPM controller driver
dt-bindings: interrupt-controller: Add Qualcomm MPM support
irqchip/apple-aic: Add support for AICv2
irqchip/apple-aic: Support multiple dies
irqchip/apple-aic: Dynamically compute register offsets
irqchip/apple-aic: Switch to irq_domain_create_tree and sparse hwirqs
irqchip/apple-aic: Add Fast IPI support
dt-bindings: interrupt-controller: apple,aic2: New binding for AICv2
PCI: apple: Change MSI handling to handle 4-cell AIC fwspec form
irqchip/apple-aic: Fix cpumask allocation for FIQs
irqchip/meson-gpio: Add support for meson s4 SoCs
irqchip/meson-gpio: add select trigger type callback
irqchip/meson-gpio: support more than 8 channels gpio irq
dt-bindings: interrupt-controller: New binding for Meson-S4 SoCs
irqchip/xilinx: Switch to GENERIC_IRQ_MULTI_HANDLER
staging: greybus: gpio: Use generic_handle_irq_safe().
net: usb: lan78xx: Use generic_handle_irq_safe().
mfd: ezx-pcap: Use generic_handle_irq_safe().
misc: hi6421-spmi-pmic: Use generic_handle_irq_safe().
irqchip/sifive-plic: Disable S-mode IRQs if running in M-mode
...
Core code:
- Make the NOHZ handling of the timekeeping/tick core more robust to
prevent a rare jiffies update stall.
- Handle softirqs in the NOHZ/idle case correctly
Drivers:
- Add support for event stream scaling of the 1GHz counter on ARM(64)
- Correct an error code check in the timer-of layer
- The usual cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4V8ITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV+PEACQHWflNPejoqcgyf0iwRlwo1s02a5M
1bMFrSqkFvMqkpnOgz29Ux9NvDK7qR1h+j00Nx0whiNZfigitkDuChj7gdiL7zBW
LpD0DNS2N1p3YJa2hnF/+3Lp4oDDI86SXJ8nza6aaXmy2ecm5EknislAgi4sZmE7
rE1h/jh6+s94ZOZrS2zjqFjuHVVaoJuqDafVSWOOVfiy3wEraDZ+KVvb+e9cXouZ
LfYzHDBw91lmVo6uXLRpEB2CRMBMj4f6ydFmfgKNsutB+i/3VOKyxuLdOjUBpOUx
P2UmeRXvmIegNRpyAEa+Y3N35qtpDoeHq8q0kJ2CLxiHrU4RiEmZAT1dNO+h/q8O
RjA4+cKphciZQRvySV1p9WTpbMEP5ZHAM7NG0k2VPT13aNki7wcs9CVOea/SE2nK
LnCEX1dP9UXT2JzBVrrtQp94zNa+tgKkmtdyuMES/dgPZPXT2kQ14/OUMTVb/28S
KiQjNVFYUKr3oMoTal7pyfybBb7TJzTYT93Klk2bNJ47XCVJ1GuNtLoUsabvMCD8
2weqCE43drj19IAsQbwzxUh0OySJF7ahzllJ+dpuCi4ghal8x+vkD57p+0eMb5mT
Ct2s0FSboOq7bBQM33kiFNBKdA1fJx51/TfZaq1zlwWYhbzHKM8rBVCvMeloK2k+
GlkVb4H2FJXSNA==
=sdyh
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and timekeeping updates from Thomas Gleixner:
"Core code:
- Make the NOHZ handling of the timekeeping/tick core more robust to
prevent a rare jiffies update stall.
- Handle softirqs in the NOHZ/idle case correctly
Drivers:
- Add support for event stream scaling of the 1GHz counter on ARM(64)
- Correct an error code check in the timer-of layer
- The usual cleanups and improvements all over the place"
* tag 'timers-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
lib/irq_poll: Declare IRQ_POLL softirq vector as ksoftirqd-parking safe
tick/rcu: Stop allowing RCU_SOFTIRQ in idle
tick/rcu: Remove obsolete rcu_needs_cpu() parameters
tick: Detect and fix jiffies update stall
clocksource/drivers/timer-of: Check return value of of_iomap in timer_of_base_init()
clocksource/drivers/timer-microchip-pit64b: Use 5MHz for clockevent
clocksource/drivers/timer-microchip-pit64b: Use notrace
clocksource/drivers/timer-microchip-pit64b: Remove mmio selection
dt-bindings: timer: Tegra: Convert text bindings to yaml
clocksource/drivers/imx-tpm: Move tpm_read_sched_clock() under CONFIG_ARM
clocksource/drivers/arm_arch_timer: Use event stream scaling when available
clocksource/drivers/exynos_mct: Increase the size of name array
clocksource/drivers/exynos_mct: Bump up mct max irq number
clocksource/drivers/exynos_mct: Remove mct interrupt index enum
clocksource/drivers/exynos_mct: Handle DTS with higher number of interrupts
clocksource/drivers/timer-ti-dm: Fix regression from errata i940 fix
clocksource/drivers/imx-tpm: Exclude sched clock for ARM64
clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
clocksource/drivers/imx-tpm: Update name of clkevt
clocksource/drivers/imx-tpm: Add CLOCK_EVT_FEAT_DYNIRQ
...
- Reduce the amount of work to release a task stack in context
switch. There is no real reason to do cgroup accounting and memory
freeing in this performance sensitive context. Aside of this the
invoked functions cannot be called from this preemption disabled
context on PREEMPT_RT enabled kernels. Solve this by moving the
accounting into do_exit() and delaying the freeing of the stack unless
the vmap stack can be cached.
- Provide a mechanism to delay raising signals from atomic context on
PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
the information in the task struct and raise it in the exit path.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4U6gTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSpkEACwgaaQUbqVrpw5yb6LbwzUPnjEdFNN
uUQCv0XZD8LWbfhcQQVSPWGho7S/w2Mkpdhi0DkVb2K0dkB7EvITSNEC4KoS/yez
8iQBpv6Lm00quHdNLjkQySSZ4NYB8M1GasBI7zSBjROK/+sRqioTPQsM0oDemGmD
uMvw0dgDJRlB8X4LZv0xuJbYLdSzu2VOlWd5aJG9BUgHkd7PfUWMlHsa29FP0hkP
A5yziOnr9kMsmCAsgmiyDW/GmefrEealby5M/jgnxTruF/OLnDsP+PYMlws47fPx
g6xpHkT5H0zQJ/nMJtK2JAlxpnbIl4cLuUnpn7wX316yjBpP2s3Pw04AVdzPPoBa
ufAoOLFtnrKN6enIqLWaJHGAsBHEULw6d3/7HoAEQOVWChnQSuWOob8z0QDbvM14
kKtz+LTrO+P5a15fd4g5+9lFBXJUTnF74SYQNwxIm2cV9hxrf15NhAr8yg+RtUvF
/ilNNAFtXkASLqs9moEi7U+GyBYwemG+gduVZ3Dw8FBxK/vHmDrhlItcZdKom+UJ
k4VFDVhzd2GYRHMrcaLfkCYew6ou+LD/rjdPhIU9OQHgILIMLY5aLqxDuyPtHqDz
TEyF5qsL4wYLIUdsWlqyHISqQQ6LfnpIyko5kb2Zt56sYtrcZr8swDy+yimiEOdL
G4BzQu0nVbCLhw==
=uGTc
-----END PGP SIGNATURE-----
Merge tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core process handling RT latency updates from Thomas Gleixner:
- Reduce the amount of work to release a task stack in context switch.
There is no real reason to do cgroup accounting and memory freeing in
this performance sensitive context.
Aside of this the invoked functions cannot be called from this
preemption disabled context on PREEMPT_RT enabled kernels. Solve this
by moving the accounting into do_exit() and delaying the freeing of
the stack unless the vmap stack can be cached.
- Provide a mechanism to delay raising signals from atomic context on
PREEMPT_RT enabled kernels as sighand::lock cannot be acquired. Store
the information in the task struct and raise it in the exit path.
* tag 'core-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
signal, x86: Delay calling signals in atomic on RT enabled kernels
fork: Use IS_ENABLED() in account_kernel_stack()
fork: Only cache the VMAP stack in finish_task_switch()
fork: Move task stack accounting to do_exit()
fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACK
fork: Don't assign the stack pointer in dup_task_struct()
fork, IA64: Provide alloc_thread_stack_node() for IA64
fork: Duplicate task_struct before stack allocation
fork: Redo ifdefs around task stack handling
- Simplify the PASID handling to allocate the PASID once, associate it to
the mm of a process and free it on mm_exit(). The previous attempt of
refcounted PASIDs and dynamic alloc()/free() turned out to be error
prone and too complex. The PASID space is 20bits, so the case of
resource exhaustion is a pure academic concern.
- Populate the PASID MSR on demand via #GP to avoid racy updates via IPIs.
- Reenable ENQCMD and let objtool check for the forbidden usage of ENQCMD
in the kernel.
- Update the documentation for Shared Virtual Addressing accordingly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmI4WpETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUfnD/0bY94rgEX4Uuy/mFQ1W8X8XlcyKrha
0/cRATb+4QV/pwJgGr2nClKhGlFMYPdJLvKMC1TCUPCVrLD1RNmluIZoFzeqXwhm
jDdCcFOuGZ2D4ujDPWwOOpKBT1ytovnQa7+lH6QJyKkEqdcC2ncOvGJQoiRxRQIG
8wTVs/OUvQJ5ZhSZQMKQN4uMWMyHEjhbroYS30/uNi/598jTPgzlEoa14XocQ9Os
nS6ALvjuc9MsJ34F61etMaJU1ZMI3Wx75u9QjEvX6hmJs87YdvgwE7lzJUKFDEuh
gewM0wp2fTa8/azzP0eMiHTin56PqFdmllzRqXmilbZMEPOeI29dZVArCdpKcAn0
r9p1kJUT3Xl2G3Oir/OdCaaQHcznD1Y5ZFOyh12wgEucZ/rdeSr7nq7n5HoOL5Bw
Q2o6YvTkE9DOL0nTN1lSXGiPspou7fzX0uUcRBrbJUS3sBv4zGIlaJXUaTVnSdAt
VZj4LeOK7v2BjyeiOY0iaaIQd3xjmLUF0UjozXS5M13SoVcToZRbyWqhDzPvNuKA
imQb/dnFpXhABgmuqAiJLeqM0VtGMFNc780OURkcsBSPng+iSEdV4DzuhK0jpU8x
Uk1RuGMd/vgmrlDFBrw+orQQiiKR1ixpI0LiHfcOBycfJhqTwcnrNZvAN5/do28Z
E23+QzlUbZF0cw==
=Dy8V
-----END PGP SIGNATURE-----
Merge tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PASID support from Thomas Gleixner:
"Reenable ENQCMD/PASID support:
- Simplify the PASID handling to allocate the PASID once, associate
it to the mm of a process and free it on mm_exit().
The previous attempt of refcounted PASIDs and dynamic
alloc()/free() turned out to be error prone and too complex. The
PASID space is 20bits, so the case of resource exhaustion is a pure
academic concern.
- Populate the PASID MSR on demand via #GP to avoid racy updates via
IPIs.
- Reenable ENQCMD and let objtool check for the forbidden usage of
ENQCMD in the kernel.
- Update the documentation for Shared Virtual Addressing accordingly"
* tag 'x86-pasid-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Update documentation for SVA (Shared Virtual Addressing)
tools/objtool: Check for use of the ENQCMD instruction in the kernel
x86/cpufeatures: Re-enable ENQCMD
x86/traps: Demand-populate PASID MSR via #GP
sched: Define and initialize a flag to identify valid PASID in the task
x86/fpu: Clear PASID when copying fpstate
iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
kernel/fork: Initialize mm's PASID
iommu/ioasid: Introduce a helper to check for valid PASIDs
mm: Change CONFIG option for mm->pasid field
iommu/sva: Rename CONFIG_IOMMU_SVA_LIB to CONFIG_IOMMU_SVA
Instead of declaring a struct page_vma_mapped_walk directly,
use these helpers to allow us to transition to a PFN approach in the
following patches.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
invalidate_inode_page() is the only caller of invalidate_complete_page()
and inlining it reveals that the first check is unnecessary (because we
hold the page locked, and we just retrieved the mapping from the page).
Actually, it does make a difference, in that tail pages no longer fail
at this check, so it's now possible to remove a tail page from a mapping.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Andrii reported that backtraces from kprobe_multi program attached
as return probes are not complete and showing just initial entry [1].
It's caused by changing registers to have original function ip address
as instruction pointer even for return probe, which will screw backtrace
from return probe.
This change keeps registers intact and store original entry ip and
link address on the stack in bpf_kprobe_multi_run_ctx struct, where
bpf_get_func_ip and bpf_get_attach_cookie helpers for kprobe_multi
programs can find it.
[1] https://lore.kernel.org/bpf/CAEf4BzZDDqK24rSKwXNp7XL3ErGD4bZa1M6c_c4EvDSt3jrZcg@mail.gmail.com/T/#m8d1301c0ea0892ddf9dc6fba57a57b8cf11b8c51
Fixes: ca74823c6e ("bpf: Add cookie support to programs attached with kprobe multi link")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220321070113.1449167-3-jolsa@kernel.org
This reverts commit 97ee4d20ee.
Following change is adding more complexity to bpf_get_func_ip
helper for kprobe_multi programs, which can't be inlined easily.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321070113.1449167-2-jolsa@kernel.org
Replace offsetof(hdr_len) + sizeof(hdr_len) with offsetofend(hdr_len) to
simplify the check for correctness of btf_data_size in btf_parse_hdr()
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220320075240.1001728-1-ytcoode@gmail.com
In watch_queue_set_size(), the error cleanup code doesn't take account of
the fact that __free_page() can't handle a NULL pointer when trying to free
up buffer pages that did get allocated.
Fix this by only calling __free_page() on the pages actually allocated.
Without the fix, this can lead to something like the following:
BUG: KASAN: null-ptr-deref in __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
Read of size 4 at addr 0000000000000034 by task syz-executor168/3599
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
__kasan_report mm/kasan/report.c:446 [inline]
kasan_report.cold+0x66/0xdf mm/kasan/report.c:459
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
page_ref_count include/linux/page_ref.h:67 [inline]
put_page_testzero include/linux/mm.h:717 [inline]
__free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
watch_queue_set_size+0x499/0x630 kernel/watch_queue.c:275
pipe_ioctl+0xac/0x2b0 fs/pipe.c:632
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-and-tested-by: syzbot+d55757faa9b80590767b@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
When CONFIG_DEBUG_INFO_BTF is disabled, bpf_get_btf_vmlinux can return a
NULL pointer. Check for it in btf_get_module_btf to prevent a NULL pointer
dereference.
While kernel test robot only complained about this specific case, let's
also check for NULL in other call sites of bpf_get_btf_vmlinux.
Fixes: 9492450fd2 ("bpf: Always raise reference in btf_get_module_btf")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220320143003.589540-1-memxor@gmail.com
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Fixes: c195651e56 ("bpf: add bpf_get_stack helper")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com
Link: https://lore.kernel.org/bpf/20220314182042.71025-1-namhyung@kernel.org
Based-on-patch-by: Eugene Loh <eugene.loh@oracle.com>
Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
cases. Specifically, for NUMA systems, __vmalloc_node_range requires
PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
does not support huge pages (i.e., with cmdline option nohugevmalloc), it
is better to use PAGE_SIZE packs.
Add logic to select proper size for bpf_prog_pack. This solution is not
ideal, as it makes assumption about the behavior of module_alloc and
__vmalloc_node_range. However, it appears to be the easiest solution as
it doesn't require changes in module_alloc and vmalloc code.
Fixes: 57631054fa ("bpf: Introduce bpf_prog_pack allocator")
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220311201135.3573610-1-song@kernel.org
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patch, the verifier detects whether the program is sleepable,
and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a
5th argument to bpf_task/sk/inode_storage_get. This flag will propagate
down to the local storage functions that allocate memory.
Please note that bpf_task/sk/inode_storage_update_elem functions are
invoked by userspace applications through syscalls. Preemption is
disabled before bpf_task/sk/inode_storage_update_elem is called, which
means they will always have to allocate memory atomically.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220318045553.3091807-2-joannekoong@fb.com
When an enum is used in the visible parts of a trace event that is
exported to user space, the user space applications like perf and
trace-cmd do not have a way to know what the value of the enum is. To
solve this, at boot up (or module load) the printk formats are modified to
replace the enum with their numeric value in the string output.
Array fields of the event are defined by [<nr-elements>] in the type
portion of the format file so that the user space parsers can correctly
parse the array into the appropriate size chunks. But in some trace
events, an enum is used in defining the size of the array, which once
again breaks the parsing of user space tooling.
This was solved the same way as the print formats were, but it modified
the type strings of the trace event. This caused crashes in some
architectures because, as supposed to the print string, is a const string
value. This was not detected on x86, as it appears that const strings are
still writable (at least in boot up), but other architectures this is not
the case, and writing to a const string will cause a kernel fault.
To fix this, use kstrdup() to copy the type before modifying it. If the
trace event is for the core kernel there's no need to free it because the
string will be in use for the life of the machine being on line. For
modules, create a link list to store all the strings being allocated for
modules and when the module is removed, free them.
Link: https://lore.kernel.org/all/yt9dr1706b4i.fsf@linux.ibm.com/
Link: https://lkml.kernel.org/r/20220318153432.3984b871@gandalf.local.home
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: b3bc8547d3 ("tracing: Have TRACE_DEFINE_ENUM affect trace event types as well")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Align it with helpers like bpf_find_btf_id, so all functions returning
BTF in out parameter follow the same rule of raising reference
consistently, regardless of module or vmlinux BTF.
Adjust existing callers to handle the change accordinly.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220317115957.3193097-10-memxor@gmail.com
In next few patches, we need a helper that searches all kernel BTFs
(vmlinux and module BTFs), and finds the type denoted by 'name' and
'kind'. Turns out bpf_btf_find_by_name_kind already does the same thing,
but it instead returns a BTF ID and optionally fd (if module BTF). This
is used for relocating ksyms in BPF loader code (bpftool gen skel -L).
We extract the core code out into a new helper bpf_find_btf_id, which
returns the BTF ID in the return value, and BTF pointer in an out
parameter. The reference for the returned BTF pointer is always raised,
hence user must either transfer it (e.g. to a fd), or release it after
use.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220317115957.3193097-2-memxor@gmail.com
Merge changes related to system sleep, PM domains changes and power
management documentation changes for 5.18-rc1:
- Fix load_image_and_restore() error path (Ye Bin).
- Fix typos in comments in the system wakeup hadling code (Tom Rix).
- Clean up non-kernel-doc comments in hibernation code (Jiapeng
Chong).
- Fix __setup handler error handling in system-wide suspend and
hibernation core code (Randy Dunlap).
- Add device name to suspend_report_result() (Youngjin Jang).
- Make virtual guests honour ACPI S4 hardware signature by
default (David Woodhouse).
- Block power off of a parent PM domain unless child is in deepest
state (Ulf Hansson).
- Use dev_err_probe() to simplify error handling for generic PM
domains (Ahmad Fatoum).
- Fix sleep-in-atomic bug caused by genpd_debug_remove() (Shawn Guo).
- Document Intel uncore frequency scaling (Srinivas Pandruvada).
* pm-sleep:
PM: hibernate: Honour ACPI hardware signature by default for virtual guests
PM: sleep: Add device name to suspend_report_result()
PM: suspend: fix return value of __setup handler
PM: hibernate: fix __setup handler error handling
PM: hibernate: Clean up non-kernel-doc comments
PM: sleep: wakeup: Fix typos in comments
PM: hibernate: fix load_image_and_restore() error path
* pm-domains:
PM: domains: Fix sleep-in-atomic bug caused by genpd_debug_remove()
PM: domains: use dev_err_probe() to simplify error handling
PM: domains: Prevent power off for parent unless child is in deepest state
* pm-docs:
Documentation: admin-guide: pm: Document uncore frequency scaling
Merge cpufreq and cpuidle changes for 5.18-rc1:
- Make the schedutil cpufreq governor use to_gov_attr_set() instead
of open coding it (Kevin Hao).
- Replace acpi_bus_get_device() with acpi_fetch_acpi_dev() in the
cpufreq longhaul driver (Rafael Wysocki).
- Unify show() and store() naming in cpufreq and make it use
__ATTR_XX (Lianjie Zhang).
- Make the intel_pstate driver use the EPP value set by the firmware
by default (Srinivas Pandruvada).
- Re-order the init checks in the powernow-k8 cpufreq driver (Mario
Limonciello).
- Make the ACPI processor idle driver check for architectural
support for LPI to avoid using it on x86 by mistake (Mario
Limonciello).
- Add Sapphire Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy).
- Add 'preferred_cstates' module argument to the intel_idle driver
to work around C1 and C1E handling issue on Sapphire Rapids (Artem
Bityutskiy).
- Add core C6 optimization on Sapphire Rapids to the intel_idle
driver (Artem Bityutskiy).
- Optimize the haltpoll cpuidle driver a bit (Li RongQing).
- Remove leftover text from intel_idle() kerneldoc comment and fix
up white space in intel_idle (Rafael Wysocki).
* pm-cpufreq:
cpufreq: powernow-k8: Re-order the init checks
cpufreq: intel_pstate: Use firmware default EPP
cpufreq: unify show() and store() naming and use __ATTR_XX
cpufreq: longhaul: Replace acpi_bus_get_device()
cpufreq: schedutil: Use to_gov_attr_set() to get the gov_attr_set
cpufreq: Move to_gov_attr_set() to cpufreq.h
* pm-cpuidle:
cpuidle: intel_idle: Drop redundant backslash at line end
cpuidle: intel_idle: Update intel_idle() kerneldoc comment
cpuidle: haltpoll: Call cpuidle_poll_state_init() later
intel_idle: add core C6 optimization for SPR
intel_idle: add 'preferred_cstates' module argument
intel_idle: add SPR support
ACPI: processor idle: Check for architectural support for LPI
cpuidle: PSCI: Move the `has_lpi` check to the beginning of the function
The signal a task should continue with after a ptrace stop is
inconsistently read, cleared, and sent. Solve this by reading and
clearing the signal to be sent in ptrace_stop.
In an ideal world everything except ptrace_signal would share a common
implementation of continuing with the signal, so ptracers could count
on the signal they ask to continue with actually being delivered. For
now retain bug compatibility and just return with the signal number
the ptracer requested the code continue with.
Link: https://lkml.kernel.org/r/875yoe7qdp.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today ptrace_message is easy to overlook as it not a core part of
ptrace_stop. It has been overlooked so much that there are places
that set ptrace_message and don't clear it, and places that never set
it. So if you get an unlucky sequence of events the ptracer may be
able to read a ptrace_message that does not apply to the current
ptrace stop.
Move setting of ptrace_message into ptrace_stop so that it always gets
set before the stop, and always gets cleared after the stop. This
prevents non-sense from being reported to userspace and makes
ptrace_message more visible in the ptrace helper functions so that
kernel developers can see it.
Link: https://lkml.kernel.org/r/87bky67qfv.fsf_-_@email.froward.int.ebiederm.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Adding support to call bpf_get_attach_cookie helper from
kprobe programs attached with kprobe multi link.
The cookie is provided by array of u64 values, where each
value is paired with provided function address or symbol
with the same array index.
When cookie array is provided it's sorted together with
addresses (check bpf_kprobe_multi_cookie_swap). This way
we can find cookie based on the address in
bpf_get_attach_cookie helper.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220316122419.933957-7-jolsa@kernel.org
Adding support to call bpf_get_func_ip helper from kprobe
programs attached by multi kprobe link.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220316122419.933957-5-jolsa@kernel.org
Adding new link type BPF_LINK_TYPE_KPROBE_MULTI that attaches kprobe
program through fprobe API.
The fprobe API allows to attach probe on multiple functions at once
very fast, because it works on top of ftrace. On the other hand this
limits the probe point to the function entry or return.
The kprobe program gets the same pt_regs input ctx as when it's attached
through the perf API.
Adding new attach type BPF_TRACE_KPROBE_MULTI that allows attachment
kprobe to multiple function with new link.
User provides array of addresses or symbols with count to attach the
kprobe program to. The new link_create uapi interface looks like:
struct {
__u32 flags;
__u32 cnt;
__aligned_u64 syms;
__aligned_u64 addrs;
} kprobe_multi;
The flags field allows single BPF_TRACE_KPROBE_MULTI bit to create
return multi kprobe.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220316122419.933957-4-jolsa@kernel.org
When kallsyms_lookup_name is called with empty string,
it will do futile search for it through all the symbols.
Skipping the search for empty string.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220316122419.933957-3-jolsa@kernel.org
Introduce FPROBE_FL_KPROBE_SHARED flag for sharing fprobe callback with
kprobes safely from the viewpoint of recursion.
Since the recursion safety of the fprobe (and ftrace) is a bit different
from the kprobes, this may cause an issue if user wants to run the same
code from the fprobe and the kprobes.
The kprobes has per-cpu 'current_kprobe' variable which protects the
kprobe handler from recursion in any case. On the other hand, the fprobe
uses only ftrace_test_recursion_trylock(), which will allow interrupt
context calls another (or same) fprobe during the fprobe user handler is
running.
This is not a matter in cases if the common callback shared among the
kprobes and the fprobe has its own recursion detection, or it can handle
the recursion in the different contexts (normal/interrupt/NMI.)
But if it relies on the 'current_kprobe' recursion lock, it has to check
kprobe_running() and use kprobe_busy_*() APIs.
Fprobe has FPROBE_FL_KPROBE_SHARED flag to do this. If your common callback
code will be shared with kprobes, please set FPROBE_FL_KPROBE_SHARED
*before* registering the fprobe, like;
fprobe.flags = FPROBE_FL_KPROBE_SHARED;
register_fprobe(&fprobe, "func*", NULL);
This will protect your common callback from the nested call.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735293127.1084943.15687374237275817599.stgit@devnote2
Add exit_handler to fprobe. fprobe + rethook allows us to hook the kernel
function return. The rethook will be enabled only if the
fprobe::exit_handler is set.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735290790.1084943.10601965782208052202.stgit@devnote2
Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.
Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735285066.1084943.9259661137330166643.stgit@devnote2
The fprobe is a wrapper API for ftrace function tracer.
Unlike kprobes, this probes only supports the function entry, but this
can probe multiple functions by one fprobe. The usage is similar, user
will set their callback to fprobe::entry_handler and call
register_fprobe*() with probed functions.
There are 3 registration interfaces,
- register_fprobe() takes filtering patterns of the functin names.
- register_fprobe_ips() takes an array of ftrace-location addresses.
- register_fprobe_syms() takes an array of function names.
The registered fprobes can be unregistered with unregister_fprobe().
e.g.
struct fprobe fp = { .entry_handler = user_handler };
const char *targets[] = { "func1", "func2", "func3"};
...
ret = register_fprobe_syms(&fp, targets, ARRAY_SIZE(targets));
...
unregister_fprobe(&fp);
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735283857.1084943.1154436951479395551.stgit@devnote2
Adding ftrace_set_filter_ips function to be able to set filter on
multiple ip addresses at once.
With the kprobe multi attach interface we have cases where we need to
initialize ftrace_ops object with thousands of functions, so having
single function diving into ftrace_hash_move_and_update_ops with
ftrace_lock is faster.
The functions ips are passed as unsigned long array with count.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/164735282673.1084943.18310504594134769804.stgit@devnote2
module_put() is not called for a patch with "forced" flag. It should
block the removal of the livepatch module when the code might still
be in use after forced transition.
klp_force_transition() currently sets "forced" flag for all patches on
the list.
In fact, any patch can be safely unloaded when it passed through
the consistency model in KLP_UNPATCHED transition.
In other words, the "forced" flag must be set only for livepatches
that are being removed. In particular, set the "forced" flag:
+ only for klp_transition_patch when the transition to KLP_UNPATCHED
state was forced.
+ all replaced patches when the transition to KLP_PATCHED state was
forced and the patch was replacing the existing patches.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
[mbenes@suse.cz: wording improvements]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220312152220.88127-1-zhouchengming@bytedance.com
Previously, I failed to realize that Kees' patch [1] has not been merged
into the mainline yet, and dropped DEBUG_INFO=y too eagerly from the
mainline. As the results, "make debug.config" won't be able to flip
DEBUG_INFO=n from the existing .config. This should close the gaps of a
few weeks before Kees' patch is there, and work regardless of their
merging status anyway.
Link: https://lore.kernel.org/all/20220125075126.891825-1-keescook@chromium.org/ [1]
Link: https://lkml.kernel.org/r/20220308153524.8618-1-quic_qiancai@quicinc.com
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is the bpf_jit_harden counterpart to commit 60b58afc96 ("bpf: fix
net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
for each subprog if there are subprogs in bpf program and constant
blinding may increase the length of program, so when running
"./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
jit_subprogs may fail because constant blinding increases the length
of subprog instructions during extra passs.
So cache the value of bpf_jit_blinding_enabled() during program
allocation, and use the cached value during constant blinding, subprog
JITing and args tracking of tail call.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220309123321.2400262-4-houtao1@huawei.com
Tracefs by default is locked down heavily. System operators can open up
some files, such as user_events to a broader set of users. These users
do not have access within tracefs beyond just the user_event files. Due
to this restriction the trace_add_event_call/remove calls will silently
fail since the caller does not have permissions to create directories.
To fix this trace_add_event_call/remove calls will be issued with
override creds of the global root UID. Creds are reverted immediately
afterward.
Link: https://lkml.kernel.org/r/20220308222807.2040-1-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to allow kprobes to skip the ENDBR instructions at sym+0 for
X86_KERNEL_IBT builds, change _kprobe_addr() to take an architecture
callback to inspect the function at hand and modify the offset if
needed.
This streamlines the existing interface to cover more cases and
require less hooks. Once PowerPC gets fully converted there will only
be the one arch hook.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.405947704@infradead.org
Currently livepatch assumes __fentry__ lives at func+0, which is most
likely untrue with IBT on. Instead make it use ftrace_location() by
default which both validates and finds the actual ip if there is any
in the same symbol.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.285971256@infradead.org
Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.
Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.
Then audit/update all callsites of this function to consistently use
these new semantics.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
- Add support for the STM32MP13 variant
- Move parent device away from struct irq_chip
- Remove all instances of non-const strings assigned to
struct irq_chip::name, enabling a nice cleanup for VIC and GIC)
- Simplify the Qualcomm PDC driver
- A bunch of SiFive PLIC cleanups
- Add support for a new variant of the Meson GPIO block
- Add support for the irqchip side of the Apple M1 PMU
- Add support for the Apple M1 Pro/Max AICv2 irqchip
- Add support for the Qualcomm MPM wakeup gadget
- Move the Xilinx driver over to the generic irqdomain handling
- Tiny speedup for IPIs on GICv3 systems
- The usual odd cleanups
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmItxJMPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD7ccQAIkkNoC6yQ+9lhbdRrlo6KUtUT2apDheIF+5
Yfo7dTeKMUb4NpQs+b4v01A0B3KSLPuwTulWfGXhsLRXVcfEEnkBCQzy/IQnkYTQ
DDvxENRz40SS0WJF1G74a7KsqHt+epyHZkB6KJQV4BYrZKxt2h0tWNSiNf1IDN/e
9mZq2kLgEk0kfRCR9u6NYGMugbrgbdtiLgwBARKdRtAAkjBlGEtC2slp0a3WTsyg
QfnMWMOK22wa34eZzFG8VrJMVwGyeqMP/ZW30EoClBzPyLUM5aZWRr+LSvLYQC4n
ho6ua1+a2726TBT6vtWNi0KDNcXwhL6JheO4m2bCoWPvu4YengfKQ5QllAFvSR3W
e4oT/xwkBcf+n5ehXEfxqTRRxG398oWYI60kX586dIcr9qN9WBsw1S5aPkDeZ+nT
6THbQ5uZrIqkeWOoJmvg+iwKkE/NQY/xUENW0zeG2f4/YLIGeKK7e1/XCl1jqzlk
vIvf/bYr64TgOvvHhIeh1G5iXQnk1TWoCzW0DQ8BIXhjlbVRG39QuvwjXKok4AhK
QgKMi6N1ge4nKO1gcYbR174gDz+MylZP41ddDACVXT/5hzsfyxLF36ixdyMLKwtr
Lybb4PGB5Pf0Zgxu6cVWeVsEZEwtlMCmIi1XUW4YRv2saypTPD5V78Ug6jbyPMXE
G7J5dxwS
=cf1B
-----END PGP SIGNATURE-----
Merge tag 'irqchip-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- Add support for the STM32MP13 variant
- Move parent device away from struct irq_chip
- Remove all instances of non-const strings assigned to
struct irq_chip::name, enabling a nice cleanup for VIC and GIC)
- Simplify the Qualcomm PDC driver
- A bunch of SiFive PLIC cleanups
- Add support for a new variant of the Meson GPIO block
- Add support for the irqchip side of the Apple M1 PMU
- Add support for the Apple M1 Pro/Max AICv2 irqchip
- Add support for the Qualcomm MPM wakeup gadget
- Move the Xilinx driver over to the generic irqdomain handling
- Tiny speedup for IPIs on GICv3 systems
- The usual odd cleanups
Link: https://lore.kernel.org/all/20220313105142.704579-1-maz@kernel.org
for spdx, add a space before //
replacements
judgement to judgment
transofrmed to transformed
partitition to partition
histrical to historical
migratecd to migrated
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge misc fixes from David Howells:
"A set of patches for watch_queue filter issues noted by Jann. I've
added in a cleanup patch from Christophe Jaillet to convert to using
formal bitmap specifiers for the note allocation bitmap.
Also two filesystem fixes (afs and cachefiles)"
* emailed patches from David Howells <dhowells@redhat.com>:
cachefiles: Fix volume coherency attribute
afs: Fix potential thrashing in afs writeback
watch_queue: Make comment about setting ->defunct more accurate
watch_queue: Fix lack of barrier/sync/lock between post and read
watch_queue: Free the alloc bitmap when the watch_queue is torn down
watch_queue: Fix the alloc bitmap size to reflect notes allocated
watch_queue: Use the bitmap API when applicable
watch_queue: Fix to always request a pow-of-2 pipe ring size
watch_queue: Fix to release page in ->release()
watch_queue, pipe: Free watchqueue state after clearing pipe ring
watch_queue: Fix filter limit check
watch_queue_clear() has a comment stating that setting ->defunct to true
preventing new additions as well as preventing notifications. Whilst
the latter is true, the first bit is superfluous since at the time this
function is called, the pipe cannot be accessed to add new event
sources.
Remove the "new additions" bit from the comment.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, watch_queue_set_size() sets the number of notes available in
wqueue->nr_notes according to the number of notes allocated, but sets
the size of the bitmap to the unrounded number of notes originally asked
for.
Fix this by setting the bitmap size to the number of notes we're
actually going to make available (ie. the number allocated).
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use bitmap_alloc() to simplify code, improve the semantic and reduce
some open-coded arithmetic in allocator arguments.
Also change a memset(0xff) into an equivalent bitmap_fill() to keep
consistency.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pipe ring size must always be a power of 2 as the head and tail
pointers are masked off by AND'ing with the size of the ring - 1.
watch_queue_set_size(), however, lets you specify any number of notes
between 1 and 511. This number is passed through to pipe_resize_ring()
without checking/forcing its alignment.
Fix this by rounding the number of slots required up to the nearest
power of two. The request is meant to guarantee that at least that many
notifications can be generated before the queue is full, so rounding
down isn't an option, but, alternatively, it may be better to give an
error if we aren't allowed to allocate that much ring space.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a pipe ring descriptor points to a notification message, the
refcount on the backing page is incremented by the generic get function,
but the release function, which marks the bitmap, doesn't drop the page
ref.
Fix this by calling generic_pipe_buf_release() at the end of
watch_queue_pipe_buf_release().
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In watch_queue_set_filter(), there are a couple of places where we check
that the filter type value does not exceed what the type_filter bitmap
can hold. One place calculates the number of bits by:
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
which is fine, but the second does:
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
which is not. This can lead to a couple of out-of-bounds writes due to
a too-large type:
(1) __set_bit() on wfilter->type_filter
(2) Writing more elements in wfilter->filters[] than we allocated.
Fix this by just using the proper WATCH_TYPE__NR instead, which is the
number of types we actually know about.
The bug may cause an oops looking something like:
BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
...
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x150
...
kasan_report.cold+0x7f/0x11b
...
watch_queue_set_filter+0x659/0x740
...
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 611:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
watch_queue_set_filter+0x23a/0x740
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88800d2c66a0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 28 bytes inside of
32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the
end of boot up just before switching over to user space (it happens during
the kernel freeing of init memory).
This is useful when there's interesting data that can be collected from
kernel start up, but gets overridden by user space start up code. With
this option, the ring buffer content from the boot up traces gets saved in
the snapshot at the end of boot up. This trace can be read from:
/sys/kernel/tracing/snapshot
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The macro TRACE_DEFINE_ENUM is used to convert enums in the kernel to
their actual value when they are exported to user space via the trace
event format file.
Currently only the enums in the "print fmt" (TP_printk in the TRACE_EVENT
macro) have the enums converted. But the enums can be used to denote array
size:
field:unsigned int fc_ineligible_rc[EXT4_FC_REASON_MAX]; offset:12; size:36; signed:0;
The EXT4_FC_REASON_MAX has no meaning to userspace but it needs to know
that information to know how to parse the array.
Have the array indexes also be parsed as well.
Link: https://lore.kernel.org/all/cover.1646922487.git.riteshh@linux.ibm.com/
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
0-day reported the strncpy error below:
../kernel/trace/trace_events_synth.c: In function 'last_cmd_set':
../kernel/trace/trace_events_synth.c:65:9: warning: 'strncpy' specified bound depends on the length o\
f the source argument [-Wstringop-truncation]
65 | strncpy(last_cmd, str, strlen(str) + 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../kernel/trace/trace_events_synth.c:65:32: note: length computed here
65 | strncpy(last_cmd, str, strlen(str) + 1);
| ^~~~~~~~~~~
There's no reason to use strncpy here, in fact there's no reason to do
anything but a simple kstrdup() (note we don't even need to check for
failure since last_cmod is expected to be either the last cmd string
or NULL, and the containing function is a void return).
Link: https://lkml.kernel.org/r/77deca8cbfd226981b3f1eab203967381e9b5bd9.camel@kernel.org
Fixes: 27c888da98 ("tracing: Remove size restriction on synthetic event cmd error logging")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Find user_events always while under the event_mutex and before leaving
the lock, add a ref count to the user_event. This ensures that all paths
under the event_mutex that check the ref counts will be synchronized.
The ioctl add/delete paths are protected by the reg_mutex. However,
dyn_event is only protected by the event_mutex. The dyn_event delete
path cannot acquire reg_mutex, since that could cause a deadlock between
the ioctl delete case acquiring event_mutex after acquiring the reg_mutex.
Link: https://lkml.kernel.org/r/20220310001141.1660-1-beaub@linux.microsoft.com
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash()
or bpf_ima_file_hash() can be called inside the implementation of this
hook.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220302111404.193900-8-roberto.sassu@huawei.com
ima_file_hash() has been modified to calculate the measurement of a file on
demand, if it has not been already performed by IMA or the measurement is
not fresh. For compatibility reasons, ima_inode_hash() remains unchanged.
Keep the same approach in eBPF and introduce the new helper
bpf_ima_file_hash() to take advantage of the modified behavior of
ima_file_hash().
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220302111404.193900-4-roberto.sassu@huawei.com
- Fix unregistering the same event twice a the user could disable
the event osnoise will disable on unregistering.
- Inform RCU of a quiescent state in the osnoise testing thread.
- Fix some kerneldoc comments.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYioe0BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qohMAQCcnN4Iy9LcKX02A5HcuUxIdH3ktOqt
lbc24+nb6AO+/wD+Ilnlx/Ir09HgaL7NreSAc0tvLMwRgaM+grLA82IRTwc=
=quYh
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Minor tracing fixes:
- Fix unregistering the same event twice. A user could disable the
same event that osnoise will disable on unregistering.
- Inform RCU of a quiescent state in the osnoise testing thread.
- Fix some kerneldoc comments"
* tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix some W=1 warnings in kernel doc comments
tracing/osnoise: Force quiescent states while tracing
tracing/osnoise: Do not unregister events twice
Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.
Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h.
While doing that rename tracehook_notify_resume to resume_user_mode_work.
Update all of the places that included tracehook.h for these functions to
include resume_user_mode.h instead.
Update all of the callers of tracehook_notify_resume to call
resume_user_mode_work.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-12-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There are a small handful of reasons besides pending signals that the
kernel might want to break out of interruptible sleeps. The flag
TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL
provide that the infrastructure for breaking out of interruptible
sleeps and entering the return to user space slow path for those
cases.
Expand tracehook_notify_signal inline in it's callers and remove it,
which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate
concepts.
Update the comment on set_notify_signal to more accurately describe
it's purpose.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>