Add a folio equivalent for page_is_pfmemalloc. This removes two instances
of page_is_pfmemalloc(folio_page(folio, 0)) so the folio can be used
directly.
Link: https://lkml.kernel.org/r/20230106215251.599222-1-sidhartha.kumar@oracle.com
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- More userfaultfs work from Peter Xu.
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
- Some filemap cleanups from Vishal Moola.
- David Hildenbrand added the ability to selftest anon memory COW handling.
- Some cpuset simplifications from Liu Shixin.
- Addition of vmalloc tracing support by Uladzislau Rezki.
- Some pagecache folioifications and simplifications from Matthew Wilcox.
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword. This series shold have been in the
non-MM tree, my bad.
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages.
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages.
- Peter Xu utilized the PTE marker code for handling swapin errors.
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient.
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand.
- zram support for multiple compression streams from Sergey Senozhatsky.
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway.
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations.
- Vishal Moola removed the try_to_release_page() wrapper.
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache.
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking.
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend.
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range().
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen.
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect.
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages().
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting.
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines.
- Many singleton patches, as usual.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
CodAgiA51qwzId3GRytIo/tfWZSezgA=
=d19R
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- More userfaultfs work from Peter Xu
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying
- Some filemap cleanups from Vishal Moola
- David Hildenbrand added the ability to selftest anon memory COW
handling
- Some cpuset simplifications from Liu Shixin
- Addition of vmalloc tracing support by Uladzislau Rezki
- Some pagecache folioifications and simplifications from Matthew
Wilcox
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
it
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword.
This series should have been in the non-MM tree, my bad
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages
- Peter Xu utilized the PTE marker code for handling swapin errors
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand
- zram support for multiple compression streams from Sergey Senozhatsky
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations
- Vishal Moola removed the try_to_release_page() wrapper
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range()
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages()
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines
- Many singleton patches, as usual
* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
mm: mmu_gather: allow more than one batch of delayed rmaps
mm: fix typo in struct pglist_data code comment
kmsan: fix memcpy tests
mm: add cond_resched() in swapin_walk_pmd_entry()
mm: do not show fs mm pc for VM_LOCKONFAULT pages
selftests/vm: ksm_functional_tests: fixes for 32bit
selftests/vm: cow: fix compile warning on 32bit
selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
mm,thp,rmap: fix races between updates of subpages_mapcount
mm: memcg: fix swapcached stat accounting
mm: add nodes= arg to memory.reclaim
mm: disable top-tier fallback to reclaim on proactive reclaim
selftests: cgroup: make sure reclaim target memcg is unprotected
selftests: cgroup: refactor proactive reclaim code to reclaim_until()
mm: memcg: fix stale protection of reclaim target memcg
mm/mmap: properly unaccount memory on mas_preallocate() failure
omfs: remove ->writepage
jfs: remove ->writepage
...
This KUnit next update for Linux 6.2-rc1 consists of several enhancements,
fixes, clean-ups, documentation updates, improvements to logging and KTAP
compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests
when KUNIT is enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmOXnPYACgkQCwJExA0N
Qxwf9RAAwdBKxgPZuKZ40v69Jm8YhaO3vyKUkyYRH59/HQGFUHMA2f2ONez4krEX
iXPgBFQ+7pB63FdgQi2HSg2z/u3xY02AaGgZGXDuNJDmg2xYjNDfZ0GjN6tuavlN
Liz01DGZkjZoVVXM6oV2xT8woBg/0BbdkKNL1OBO9RBZFHzwDryRzfXmQb8cKlNr
S+tkeZTlCA/s7UW2LNj4VlTzn6wgni4Y9gSk4wbQmSGWn3OX3rHaqAb7GiZ/yPGb
1WjbMeE8FwyydLU40aOZZ8V6AJRiw5VGPJyFzWJyWZ21xOgN9Z95b+I36z8RXraA
i/wnazO/FJsrhzvKL83rQkrSW6bpmVY+jGvk+L6deFM6Ro/vEWHJ4DgyKsIdMiJy
gUM1Q69szptq+ZRHGrZWPlVONBkBXMOL+fePbCbGcMzlaEAS/zsFYW9IBKcvLzwP
uHzzMS/cMmSUq52ZIyl9jhHQFVSoErCpJwQjAaZBQpYXPmE7yLcZItxnCaSUQTay
bRwyps5ph5md0oJTTFJKZ4Zx5FJ2ItjbC4y9BIexb9gYRDdRq723ivDoVENZl/Zk
DFIV95AY+mSxadS5vFagwWwX0ZN0KFKxeM8Tw7VTimal/0Sbglqp+oflsuKFD6JQ
b5HUixYifKMbWxkH5xrUb8NdjmBj561TYa8U4N+j3oOiaPYu5Ss=
=UQNn
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
"Several enhancements, fixes, clean-ups, documentation updates,
improvements to logging and KTAP compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests when KUNIT is
enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition"
* tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits)
Documentation: dev-tools: Clarify requirements for result description
apparmor: test: make static symbols visible during kunit testing
kunit: add macro to allow conditionally exposing static symbols to tests
kunit: tool: make parser preserve whitespace when printing test log
Documentation: kunit: Fix "How Do I Use This" / "Next Steps" sections
kunit: tool: don't include KTAP headers and the like in the test log
kunit: improve KTAP compliance of KUnit test output
kunit: tool: parse KTAP compliant test output
mm: slub: test: Use the kunit_get_current_test() function
kunit: Use the static key when retrieving the current test
kunit: Provide a static key to check if KUnit is actively running tests
kunit: tool: make --json do nothing if --raw_ouput is set
kunit: tool: tweak error message when no KTAP found
kunit: remove KUNIT_INIT_MEM_ASSERTION macro
Documentation: kunit: Remove redundant 'tips.rst' page
Documentation: KUnit: reword description of assertions
Documentation: KUnit: make usage.rst a superset of tips.rst, remove duplication
kunit: eliminate KUNIT_INIT_*_ASSERT_STRUCT macros
kunit: tool: remove redundant file.close() call in unit test
kunit: tool: unit tests all check parser errors, standardize formatting a bit
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
Use the newly-added function kunit_get_current_test() instead of
accessing current->kunit_test directly. This function uses a static key
to return more quickly when KUnit is enabled, but no tests are actively
running. There should therefore be a negligible performance impact to
enabling the slub KUnit tests.
Other than the performance improvement, this should be a no-op.
Cc: Oliver Glitta <glittao@gmail.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Gow <davidgow@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Merge my series [1] to deprecate the SLOB allocator.
- Renames CONFIG_SLOB to CONFIG_SLOB_DEPRECATED with deprecation notice.
- The recommended replacement is CONFIG_SLUB, optionally with the new
CONFIG_SLUB_TINY tweaks for systems with 16MB or less RAM.
- Use cases that stopped working with CONFIG_SLUB_TINY instead of SLOB
should be reported to linux-mm@kvack.org and slab maintainers,
otherwise SLOB will be removed in few cycles.
[1] https://lore.kernel.org/all/20221121171202.22080-1-vbabka@suse.cz/
SLUB gets most of its scalability by percpu slabs. However for
CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability.
Thus, #ifdef out the whole kmem_cache_cpu percpu structure and
associated code. Additionally to the slab page savings, this reduces
percpu allocator usage, and code size.
This change builds on recent commit c7323a5ad0 ("mm/slub: restrict
sysfs validation to debug caches and make it safe"), as caches with
enabled debugging also avoid percpu slabs and all allocations and
freeing ends up working with the partial list. With a bit more
refactoring by the preceding patches, use the same code paths with
CONFIG_SLUB_TINY.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
In the following patch we want to introduce CONFIG_SLUB_TINY allocation
paths that don't use the percpu slab. To prepare, refactor the
allocation functions:
Split out __slab_alloc_node() from slab_alloc_node() where the former
does the actual allocation and the latter calls the pre/post hooks.
Analogically, split out __kmem_cache_alloc_bulk() from
kmem_cache_alloc_bulk().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Since commit c7323a5ad0 ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), caches with debugging enabled use the
free_debug_processing() function to do both freeing checks and actual
freeing to partial list under list_lock, bypassing the fast paths.
We will want to use the same path for CONFIG_SLUB_TINY, but without the
debugging checks, so refactor the code so that free_debug_processing()
does only the checks, while the freeing is handled by a new function
free_to_partial_list().
For consistency, change return parameter alloc_debug_processing() from
int to bool and correct the !SLUB_DEBUG variant to return true and not
false. This didn't matter until now, but will in the following changes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering
the default slub_max_order we can make slab allocations use smaller
pages. However depending on object sizes, order-0 might not be the best
due to increased fragmentation. When testing on a 8MB RAM k210 system by
Damien Le Moal [1], slub_max_order=1 had the best results, so use that
as the default for CONFIG_SLUB_TINY.
[1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
SLUB will leave a number of slabs on the partial list even if they are
empty, to avoid some slab freeing and reallocation. The goal of
CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0
for immediate slab page freeing.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Currently SLUB enables its sysfs support depending unconditionally on
the general CONFIG_SYSFS setting. To reduce the configuration
combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by
reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that
real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but
a randconfig might.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
With CONFIG_HARDENED_USERCOPY not enabled, there are no
__check_heap_object() checks happening that would use the struct
kmem_cache useroffset and usersize fields. Yet the fields are still
initialized, preventing merging of otherwise compatible caches.
Also the fields contribute to struct kmem_cache size unnecessarily when
unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is
disabled. In kmem_dump_obj() print object_size instead of usersize, as
that's actually the intention.
In a quick virtme boot test, this has reduced the number of caches in
/proc/slabinfo from 131 to 111.
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
kmalloc() redzone improvements by Feng Tang
From cover letter [1]:
kmalloc's API family is critical for mm, and one of its nature is that
it will round up the request size to a fixed one (mostly power of 2).
When user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so there is an extra space than what is originally
requested.
This patchset tries to extend the redzone sanity check to the extra
kmalloced buffer than requested, to better detect un-legitimate access
to it. (depends on SLAB_STORE_USER & SLAB_RED_ZONE)
[1] https://lore.kernel.org/all/20221021032405.1825078-1-feng.tang@intel.com/
A series by myself to reorder fields in struct slab to allow the
embedded rcu_head to grow (for debugging purposes). Requires changes to
isolate_movable_page() to skip slab pages which can otherwise become
false-positive __PageMovable due to its use of low bits in
page->mapping.
- Two patches for SLUB's sysfs by Rasmus Villemoes to remove dead code
and optimize boot time with late initialization.
- Allow SLUB's sysfs 'failslab' parameter to be runtime-controllable
again as it can be both useful and safe, by Alexander Atanasov.
In the next commit we want to rearrange struct slab fields to allow a larger
rcu_head. Afterwards, the page->mapping field will overlap with SLUB's "struct
list_head slab_list", where the value of prev pointer can become LIST_POISON2,
which is 0x122 + POISON_POINTER_DELTA. Unfortunately the bit 1 being set can
confuse PageMovable() to be a false positive and cause a GPF as reported by lkp
[1].
To fix this, make isolate_movable_page() skip pages with the PageSlab flag set.
This is a bit tricky as we need to add memory barriers to SLAB and SLUB's page
allocation and freeing, and their counterparts to isolate_movable_page().
Based on my RFC from [2]. Added a comment update from Matthew's variant in [3]
and, as done there, moved the PageSlab checks to happen before trying to take
the page lock.
[1] https://lore.kernel.org/all/208c1757-5edd-fd42-67d4-1940cc43b50f@intel.com/
[2] https://lore.kernel.org/all/aec59f53-0e53-1736-5932-25407125d4d4@suse.cz/
[3] https://lore.kernel.org/all/YzsVM8eToHUeTP75@casper.infradead.org/
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
SLUB allocator relies on percpu allocator to initialize its ->cpu_slab
during early boot. For that, the dynamic chunk of percpu which serves
the early allocation need be large enough to satisfy the kmalloc
creation.
However, the current BUILD_BUG_ON() in alloc_kmem_cache_cpus() doesn't
consider the kmalloc array with NR_KMALLOC_TYPES length. Fix that
with correct calculation.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.
To better detect out of bound access or abuse of this space, add
redzone sanity check for it.
In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.
In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
When kasan is enabled for slab/slub, it may save kasan' free_meta
data in the former part of slab object data area in slab object's
free path, which works fine.
There is ongoing effort to extend slub's debug function which will
redzone the latter part of kmalloc object area, and when both of
the debug are enabled, there is possible conflict, especially when
the kmalloc object has small size, as caught by 0Day bot [1].
To solve it, slub code needs to know the in-object kasan's meta
data size. Currently, there is existing kasan_metadata_size()
which returns the kasan's metadata size inside slub's metadata
area, so extend it to also cover the in-object meta size by
adding a boolean flag 'in_object'.
There is no functional change to existing code logic.
[1]. https://lore.kernel.org/lkml/YuYm3dWwpZwH58Hu@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
kzalloc/kmalloc will round up the request size to a fixed size
(mostly power of 2), so the allocated memory could be more than
requested. Currently kzalloc family APIs will zero all the
allocated memory.
To detect out-of-bound usage of the extra allocated memory, only
zero the requested part, so that redzone sanity check could be
added to the extra space later.
For kzalloc users who will call ksize() later and utilize this
extra space, please be aware that the space is not zeroed any
more when debug is enabled. (Thanks to Kees Cook's effort to
sanitize all ksize() user cases [1], this won't be a big issue).
[1]. https://lore.kernel.org/all/20220922031013.2150682-1-keescook@chromium.org/#r
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Commit 76ae847497 ("Documentation: raise minimum supported version of
GCC to 5.1") updated the minimum gcc version to 5.1. So the problem
mentioned in f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") no longer exist. So we can now switch to
use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().
Link: https://lkml.kernel.org/r/20220923033347.3935160-4-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
freeing. The rcu callback rcu_free_slab() calls __free_slab() that
currently includes checking the slab consistency for caches with
SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
to be intact.
Because in the next patch we want to allow rcu_head in struct slab to
become larger in debug configurations and thus potentially overwrite
more fields through a union than slab_list, we want to limit the fields
used in rcu_free_slab(). Thus move the consistency checks to
free_slab() before call_rcu(). This can be done safely even for
SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
occur after freeing them.
As a result, only the slab->slab_cache field has to be physically
separate from rcu_head for the freeing callback to work. We also save
some cycles in the rcu callback for caches with consistency checks
enabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
After commit c7323a5ad0 ("mm/slub: restrict sysfs validation to debug
caches and make it safe"), SLUB never installs percpu slab for debug caches
and thus never deactivates percpu slab for them.
Since only debug caches use the full list, SLUB no longer deactivates to
full list. Remove dead code in deactivate_slab().
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In (060807f841 mm, slub: make remaining slub_debug related attributes
read-only) failslab was made read-only.
I think it became a collateral victim to the two other options for which
the reasons are perfectly valid.
Here is why:
- sanity_checks and trace are slab internal debug options,
failslab is used for fault injection.
- for fault injections, which by presumption are random, it
does not matter if it is not set atomically. And you need to
set atleast one more option to trigger fault injection.
- in a testing scenario you may need to change it at runtime
example: module loading - you test all allocations limited
by the space option. Then you move to test only your module's
own slabs.
- when set by command line flags it effectively disables all
cache merges.
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz
Signed-off-by: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Currently, slab_sysfs_init() is an __initcall aka device_initcall. It
is rather time-consuming; on my board it takes around 11ms. That's
about 1% of the time budget I have from U-Boot letting go and until
linux must assume responsibility of keeping the external watchdog
happy.
There's no particular reason this would need to run at device_initcall
time, so instead make it a late_initcall to allow vital functionality
to get started a bit sooner.
This actually ends up winning more than just those 11ms, because the
slab caches that get created during other device_initcalls (and before
my watchdog device gets probed) now don't end up doing the somewhat
expensive sysfs_slab_add() themselves. Some example lines (with
initcall_debug set) before/after:
initcall ext4_init_fs+0x0/0x1ac returned 0 after 1386 usecs
initcall journal_init+0x0/0x138 returned 0 after 517 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 294 usecs
initcall ext4_init_fs+0x0/0x1ac returned 0 after 240 usecs
initcall journal_init+0x0/0x138 returned 0 after 32 usecs
initcall init_fat_fs+0x0/0x68 returned 0 after 18 usecs
Altogether, this means I now get to petting the watchdog around 17ms
sooner. [Of course, the time the other initcalls save is instead spent
in slab_sysfs_init(), which goes from 11ms to 16ms, so there's no
overall change in boot time.]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The function sysfs_slab_add() has two callers:
One is slab_sysfs_init(), which first initializes slab_kset, and only
when that succeeds sets slab_state to FULL, and then proceeds to call
sysfs_slab_add() for all previously created slabs.
The other is __kmem_cache_create(), but only after a
if (slab_state <= UP)
return 0;
check.
So in other words, sysfs_slab_add() is never called without
slab_kset (aka the return value of cache_kset()) being non-NULL.
And this is just as well, because if we ever did take this path and
called kobject_init(&s->kobj), and then later when called again from
slab_sysfs_init() would end up calling kobject_init_and_add(), we
would hit
if (kobj->state_initialized) {
/* do not error out as sometimes we can recover */
pr_err("kobject (%p): tried to init an initialized object, something is seriously wrong.\n",
dump_stack();
}
in kobject.c.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
In order to report uninitialized memory coming from heap allocations KMSAN
has to poison them unless they're created with __GFP_ZERO.
It's handy that we need KMSAN hooks in the places where
init_on_alloc/init_on_free initialization is performed.
In addition, we apply __no_kmsan_checks to get_freepointer_safe() to
suppress reports when accessing freelist pointers that reside in freed
objects.
Link: https://lkml.kernel.org/r/20220915150417.722975-16-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When enable kasan and kfence's in-kernel kunit test with slub_debug on,
it caught a problem (in linux-next tree):
------------[ cut here ]------------
kmem_cache_destroy test: Slab cache still has objects when called from test_exit+0x1a/0x30
WARNING: CPU: 3 PID: 240 at mm/slab_common.c:492 kmem_cache_destroy+0x16c/0x170
Modules linked in:
CPU: 3 PID: 240 Comm: kunit_try_catch Tainted: G B N 6.0.0-rc7-next-20220929 #52
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:kmem_cache_destroy+0x16c/0x170
Code: 41 5c 41 5d e9 a5 04 0b 00 c3 cc cc cc cc 48 8b 55 60 48 8b 4c 24 20 48 c7 c6 40 37 d2 82 48 c7 c7 e8 a0 33 83 e8 4e d7 14 01 <0f> 0b eb a7 41 56 41 89 d6 41 55 49 89 f5 41 54 49 89 fc 55 48 89
RSP: 0000:ffff88800775fea0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffff83bdec48 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 1ffff11000eebf9e RDI: ffffed1000eebfc6
RBP: ffff88804362fa00 R08: ffffffff81182e58 R09: ffff88800775fbdf
R10: ffffed1000eebf7b R11: 0000000000000001 R12: 000000008c800d00
R13: ffff888005e78040 R14: 0000000000000000 R15: ffff888005cdfad0
FS: 0000000000000000(0000) GS:ffff88807ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000360e001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
test_exit+0x1a/0x30
kunit_try_run_case+0xad/0xc0
kunit_generic_run_threadfn_adapter+0x26/0x50
kthread+0x17b/0x1b0
It was biscted to commit c7323a5ad0 ("mm/slub: restrict sysfs
validation to debug caches and make it safe")
The problem is inside free_debug_processing(), under certain
circumstances the slab can be removed from the partial list but not
freed by discard_slab() and thus n->nr_slabs is not decreased
accordingly. During shutdown, this non-zero n->nr_slabs is detected and
reported.
Specifically, the problem is that there are two checks for detecting a
full partial list by comparing n->nr_partial >= s->min_partial where the
latter check is affected by remove_partial() decreasing n->nr_partial
between the checks. Reoganize the code so there is a single check
upfront.
Link: https://lore.kernel.org/all/20220930100730.250248-1-feng.tang@intel.com/
Fixes: c7323a5ad0 ("mm/slub: restrict sysfs validation to debug caches and make it safe")
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
A patch from Feng Tang that enhances the existing debugfs alloc_traces
file for kmalloc caches with information about how much space is wasted
by allocations that needs less space than the particular kmalloc cache
provides.
As Christophe JAILLET suggested [1]
In create_unique_id(),
"looks that ID_STR_LENGTH could even be reduced to 32 or 16.
The 2nd BUG_ON at the end of the function could certainly be just
removed as well or remplaced by a:
if (p > name + ID_STR_LENGTH - 1) {
kfree(name);
return -E<something>;
}
"
According to above suggestion, let's do below cleanups:
1. reduce ID_STR_LENGTH to 32, as the buffer size should be enough;
2. use WARN_ON instead of BUG_ON() and return error if check condition
is true;
3. use snprintf instead of sprintf to avoid overflow.
[1] https://lore.kernel.org/linux-mm/2025305d-16db-abdf-6cd3-1fb93371c2b4@wanadoo.fr/
Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.
The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.
We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:
[ 26.062145] kmalloc-2k 814056KB 814056KB
From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).
And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:
[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB
The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)
So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.
The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:
126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...
which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().
And when system starts some real workload like multiple docker
instances, there could are more severe waste.
[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/
[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
My series [1] to fix validation races for caches with enabled debugging.
By decoupling the debug cache operation more from non-debug fastpaths,
additional locking simplifications were possible and done afterwards.
Additional cleanup of PREEMPT_RT specific code on top, by Thomas Gleixner.
[1] https://lore.kernel.org/all/20220823170400.26546-1-vbabka@suse.cz/
The "common kmalloc v4" series [1] by Hyeonggon Yoo.
- Improves the mm/slab_common.c wrappers to allow deleting duplicated
code between SLAB and SLUB.
- Large kmalloc() allocations in SLAB are passed to page allocator like
in SLUB, reducing number of kmalloc caches.
- Removes the {kmem_cache_alloc,kmalloc}_node variants of tracepoints,
node id parameter added to non-_node variants.
- 8 files changed, 341 insertions(+), 651 deletions(-)
[1] https://lore.kernel.org/all/20220817101826.236819-1-42.hyeyoo@gmail.com/
--
Merge resolves trivial conflict in mm/slub.c with commit 5373b8a09d
("kasan: call kasan_malloc() from __kmalloc_*track_caller()")
Commit 5a836bf6b0 ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
invocations to the global workqueue to avoid a problem related
with deactivate_slab()/__free_slab() being called from an IRQ context
on PREEMPT_RT kernels.
When the flush_all_cpu_locked() function is called from a task context
it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
flushing the global workqueue, this will cause a dependency issue.
workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
check_flush_dependency+0x10a/0x120
Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace:
__flush_work.isra.0+0xbf/0x220
? __queue_work+0x1dc/0x420
flush_all_cpus_locked+0xfb/0x120
__kmem_cache_shutdown+0x2b/0x320
kmem_cache_destroy+0x49/0x100
bioset_exit+0x143/0x190
blk_release_queue+0xb9/0x100
kobject_cleanup+0x37/0x130
nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]
Fix this bug by creating a workqueue for the flush operation with
the WQ_MEM_RECLAIM bit set.
Fixes: 5a836bf6b0 ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Cc: <stable@vger.kernel.org>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The slub code already has a few helpers depending on PREEMPT_RT. Add a few
more and get rid of the CONFIG_PREEMPT_RT conditionals all over the place.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab()
(through slab_[un]lock()) is unnecessary as bit_spin_lock() disables
preemption and that's sufficient on PREEMPT_RT where no allocation/free
operation is performed in hardirq context and so can't interrupt the
current operation.
That means we no longer need the slab_[un]lock() wrappers, so delete
them and rename the current __slab_[un]lock() to slab_[un]lock().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
The only remaining user of object_map_lock is list_slab_objects().
Obtaining the lock there used to happen under slab_lock() which implied
disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the
slab_lock() removed, we can convert it to a normal spinlock.
Also remove the get_map()/put_map() wrappers as list_slab_objects()
became their only remaining user.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
All alloc and free operations on debug caches are now serialized by
n->list_lock, so we can remove slab_lock() usage in validate_slab()
and list_slab_objects() as those also happen under n->list_lock.
Note the usage in list_slab_objects() could happen even on non-debug
caches, but only during cache shutdown time, so there should not be any
parallel freeing activity anymore. Except for buggy slab users, but in
that case the slab_lock() would not help against the common cmpxchg
based fast paths (in non-debug caches) anyway.
Also adjust documentation comments accordingly.
Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Rongwei Wang reports [1] that cache validation triggered by writing to
/sys/kernel/slab/<cache>/validate is racy against normal cache
operations (e.g. freeing) in a way that can cause false positive
inconsistency reports for caches with debugging enabled. The problem is
that debugging actions that mark object free or active and actual
freelist operations are not atomic, and the validation can see an
inconsistent state.
For caches that do or don't have debugging enabled, additional races
involving n->nr_slabs are possible that result in false reports of wrong
slab counts.
This patch attempts to solve these issues while not adding overhead to
normal (especially fastpath) operations for caches that do not have
debugging enabled. Such overhead would not be justified to make possible
userspace-triggered validation safe. Instead, disable the validation for
caches that don't have debugging enabled and make their sysfs validate
handler return -EINVAL.
For caches that do have debugging enabled, we can instead extend the
existing approach of not using percpu freelists to force all alloc/free
operations to the slow paths where debugging flags is checked and acted
upon. There can adjust the debug-specific paths to increase n->list_lock
coverage against concurrent validation as necessary.
The processing on free in free_debug_processing() already happens under
n->list_lock so we can extend it to actually do the freeing as well and
thus make it atomic against concurrent validation. As observed by
Hyeonggon Yoo, we do not really need to take slab_lock() anymore here
because all paths we could race with are protected by n->list_lock under
the new scheme, so drop its usage here.
The processing on alloc in alloc_debug_processing() currently doesn't
take any locks, but we have to first allocate the object from a slab on
the partial list (as debugging caches have no percpu slabs) and thus
take the n->list_lock anyway. Add a function alloc_single_from_partial()
that grabs just the allocated object instead of the whole freelist, and
does the debug processing. The n->list_lock coverage again makes it
atomic against validation and it is also ultimately more efficient than
the current grabbing of freelist immediately followed by slab
deactivation.
To prevent races on n->nr_slabs updates, make sure that for caches with
debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
n->list_lock. When allocating a new slab for a debug cache, handle the
allocation by a new function alloc_single_from_new_slab() instead of the
current forced deactivation path.
Neither of these changes affect the fast paths at all. The changes in
slow paths are negligible for non-debug caches.
[1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/
Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
We were failing to call kasan_malloc() from __kmalloc_*track_caller()
which was causing us to sometimes fail to produce KASAN error reports
for allocations made using e.g. devm_kcalloc(), as the KASAN poison was
not being initialized. Fix it.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: <stable@vger.kernel.org> # 5.15
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
By default kfence allocation can happen for any slab object, whose size is
up to PAGE_SIZE, as long as that allocation is the first allocation after
expiration of kfence sample interval. But in certain debugging scenarios
we may be interested in debugging corruptions involving some specific slub
objects like dentry or ext4_* etc. In such cases limiting kfence for
allocations involving only specific slub objects will increase the
probablity of catching the issue since kfence pool will not be consumed by
other slab objects.
This patch introduces a sysfs interface
'/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific
slabs. Having the interface work in this way does not impact
current/default behavior of kfence and allows us to use kfence for
specific slabs (when needed) as well. The decision to skip/use kfence is
taken depending on whether kmem_cache.flags has (newly introduced)
SLAB_SKIP_KFENCE flag set or not.
Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>