The commit 4c39529663 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.
So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 4c39529663 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
list_entry() will never return a NULL pointer, thus remove the
check.
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Some IO will dispatch from kworker with different io_context settings
than the submitting task, we may need to specify a priority to avoid
losing priority.
Add dm_bufio_read_with_ioprio() and dm_bufio_prefetch_with_ioprio()
for use by bufio users to pass an ioprio other than IOPRIO_DEFAULT.
Co-developed-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[snitzer: introduced _with_ioprio() wrappers to reduce churn]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Some IO will dispatch from kworker with different io_context settings
than the submitting task, we may need to specify a priority to avoid
losing priority.
Add IO priority parameter to dm_io() and update all callers.
Co-developed-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dm-bufio has a no-sleep mode. When activated (with the
DM_BUFIO_CLIENT_NO_SLEEP flag), the bufio client is read-only and we
could call dm_bufio_get from tasklets. This is used by dm-verity.
Unfortunately, commit 450e8dee51 ("dm bufio: improve concurrent IO
performance") broke this and the kernel would warn that cache_get()
was calling down_read() from no-sleeping context. The bug can be
reproduced by using "veritysetup open" with the "--use-tasklets"
flag.
This commit fixes dm-bufio, so that the tasklet mode works again, by
expanding use of the 'no_sleep_enabled' static_key to conditionally
use either a rw_semaphore or rwlock_t (which are colocated in the
buffer_tree structure using a union).
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v6.4
Fixes: 450e8dee51 ("dm bufio: improve concurrent IO performance")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the dm-bufio shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct dm_bufio_client.
Link: https://lkml.kernel.org/r/20230911094444.68966-24-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the past, the function __vmalloc didn't respect the GFP flags - it
allocated memory with the provided gfp flags, but it allocated page tables
with GFP_KERNEL. This was fixed in commit 451769ebb7 ("mm/vmalloc:
alloc GFP_NO{FS,IO} for vmalloc") so the memalloc_noio_{save,restore}
workaround is no longer needed.
The function kvmalloc didn't like flags different from GFP_KERNEL. This
was fixed in commit a421ef3030 ("mm: allow !GFP_KERNEL allocations
for kvmalloc"), so kvmalloc can now be called with GFP_NOIO.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
As described in commit 8111964f1b ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be
triggered because shrinker_rwsem currently needs to held by
dm_pool_abort_metadata() as a side-effect of thin-pool metadata
operation failure.
The following three problem scenarios have been noticed:
1) Described by commit 8111964f1b ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata")
2) shrinker_rwsem and throttle->lock
P1(drop cache) P2(kworker)
drop_caches_sysctl_handler
drop_slab
shrink_slab
down_read(&shrinker_rwsem) - LOCK A
do_shrink_slab
super_cache_scan
prune_icache_sb
dispose_list
evict
ext4_evict_inode
ext4_clear_inode
ext4_discard_preallocations
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_wait_block_bitmap
__ext4_error
ext4_handle_error
ext4_commit_super
...
dm_submit_bio
do_worker
throttle_work_update
down_write(&t->lock) -- LOCK B
process_deferred_bios
commit
metadata_operation_failed
dm_pool_abort_metadata
dm_block_manager_create
dm_bufio_client_create
register_shrinker
down_write(&shrinker_rwsem)
-- LOCK A
thin_map
thin_bio_map
thin_defer_bio_with_throttle
throttle_lock
down_read(&t->lock) - LOCK B
3) shrinker_rwsem and wait_on_buffer
P1(drop cache) P2(kworker)
drop_caches_sysctl_handler
drop_slab
shrink_slab
down_read(&shrinker_rwsem) - LOCK A
do_shrink_slab
...
ext4_wait_block_bitmap
__ext4_error
ext4_handle_error
jbd2_journal_abort
jbd2_journal_update_sb_errno
jbd2_write_superblock
submit_bh
// LOCK B
// RELEASE B
do_worker
throttle_work_update
down_write(&t->lock) - LOCK B
process_deferred_bios
process_bio
commit
metadata_operation_failed
dm_pool_abort_metadata
dm_block_manager_create
dm_bufio_client_create
register_shrinker
register_shrinker_prepared
down_write(&shrinker_rwsem) - LOCK A
bio_endio
wait_on_buffer
__wait_on_buffer
Fix these by resetting dm_bufio_client without holding shrinker_rwsem.
Fixes: 8111964f1b ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Both bufio and bio-prison-v1 use the identical model for splitting
their respective locks and rbtrees. Improve dm_num_hash_locks() to
distribute across more rbtrees to improve overall performance -- but
the maximum number of locks/rbtrees is still 64.
Also factor out a common hash function named dm_hash_locks_index(),
the magic numbers used were determined to be best using this program:
https://gist.github.com/jthornber/e05c47daa7b500c56dc339269c5467fc
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add num_locks member to dm_buffer_cache struct and use it rather than
the NR_LOCKS magic value (64).
Next commit will size the dm_buffer_cache's buffer_trees according to
dm_num_hash_locks().
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The kernel supports multi page bio vector entries, so we can use them
in dm-bufio as an optimization.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Save one spinlock by using waitqueue_active. We hold the bufio lock at
this place, so no one can add entries to the waitqueue at this point.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Movement also consolidates holes in dm_bufio_client struct. But the
overall size of the struct isn't changed.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Sometimes it is beneficial to repeatedly get and drop locks as part of
an iteration. Introduce lock_history struct to help avoid redundant
drop and gets of the same lock.
Optimizes cache_iterate, cache_mark_many and cache_evict.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
When multiple threads perform IO to a thin device, the underlying
dm_bufio object can become a bottleneck; slowing down access to btree
nodes that store the thin metadata. Prior to this commit, each bufio
instance had a single mutex that was taken for every bufio operation.
This commit concentrates on improving the common case where: a user of
dm_bufio wishes to access, but not modify, a buffer which is already
within the dm_bufio cache.
Implementation::
The code has been refactored; pulling out an 'lru' abstraction and a
'buffer cache' abstraction (see 2 previous commits). This commit
updates higher level bufio code (that performs allocation of buffers,
IO and eviction/cache sizing) to leverage both abstractions. It also
deals with the delicate locking requirements of both abstractions to
provide finer grained locking. The result is significantly better
concurrent IO performance.
Before this commit, bufio has a global lru list it used to evict the
oldest, clean buffers from _all_ clients. With the new locking we
don’t want different ways to access the same buffer, so instead
do_global_cleanup() loops around the clients asking them to free
buffers older than a certain time.
This commit also converts many old BUG_ONs to WARN_ON_ONCE, see the
lru_evict and cache_evict code in particular. They will return
ER_DONT_EVICT if a given buffer somehow meets the invariants that
should _never_ happen. [Aside from revising this commit's header and
fixing coding style and whitespace nits: this switching to
WARN_ON_ONCE is Mike Snitzer's lone contribution to this commit]
Testing::
Some of the low level functions have been unit tested using dm-unit:
https://github.com/jthornber/dm-unit/blob/main/src/tests/bufio.rs
Higher level concurrency and IO is tested via a test only target
found here:
https://github.com/jthornber/linux/blob/2023-03-24-thin-concurrency-9/drivers/md/dm-bufio-test.c
The associated userland side of these tests is here:
https://github.com/jthornber/dmtest-python/blob/main/src/dmtest/bufio/bufio_tests.py
In addition the full dmtest suite of tests (dm-thin, dm-cache, etc)
has been run (~450 tests).
Performance::
Most bufio operations have unchanged performance. But if multiple
threads are attempting to get buffers concurrently, and these
buffers are already in the cache then there's a big speed up. Eg,
one test has 16 'hotspot' threads simulating btree lookups while
another thread dirties the whole device. In this case the hotspot
threads acquire the buffers about 25 times faster.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The buffer cache is responsible for managing the holder count,
tracking clean/dirty state, and choosing buffers via predicates.
Higher level code is responsible for allocation of buffers, IO and
eviction/cache sizing.
The buffer cache has thread safe methods for acquiring a reference
to an existing buffer. All other methods in buffer cache are _not_
threadsafe, and only contain enough locking to guarantee the safe
methods.
Rather than a single mutex, sharded rw_semaphores are used to allow
concurrent threads to 'get' buffers. Each rw_semaphore protects its
own rbtree of buffer entries.
Code that uses this new dm_buffer_cache abstraction will be introduced
in a following commit.
This commit moves the dm_buffer struct in preparation for finer grained
dm_buffer changes, in the next commit, to be more easily seen. It also
introduces temporary dm_buffer struct members to allow compilation of
this intermediate commit (they will be elided in the next commit).
This commit will cause "defined but not used" compiler warnings that
will be resolved by the next commit.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
A CLOCK algorithm is used in this LRU abstraction. This avoids
relinking list nodes, which would require a write lock protecting it.
None of the LRU methods are threadsafe; locking must be done at a
higher level.
Code that uses this new LRU will be introduced in the next 2 commits.
As such, this commit will cause "defined but not used" compiler warnings
that will be resolved by the next 2 commits.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reasonable to relax to WARN_ON because these are easily avoided but do
offer some assurance future coding mistakes won't occur (if changes
tested properly).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
All these instances are entirely avoidable given that they speak to
coding mistakes that result in inappropriate use. Proper testing during
development will catch any such coding bug so its best to relax all of
these from BUG_ON to WARN_ON_ONCE.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Was used by multi-snapshot DM target that never went upstream.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.
Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Commit e33c267ab7 ("mm: shrinkers: provide shrinkers with names")
chose some fairly bad names for DM's shrinkers.
Fixes: e33c267ab7 ("mm: shrinkers: provide shrinkers with names")
Signed-off-by : Mike Snitzer <snitzer@kernel.org>
The 'no_sleep_enabled' should be decreased in error handling path
in dm_bufio_client_create() when the DM_BUFIO_CLIENT_NO_SLEEP flag
is set, otherwise static_branch_unlikely() will always return true
even if no dm_bufio_client instances have DM_BUFIO_CLIENT_NO_SLEEP
flag set.
Cc: stable@vger.kernel.org
Fixes: 3c1c875d05 ("dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The function test_bit doesn't provide any memory barrier. It may be
possible that the read requests that follow test_bit(B_READING, &b->state)
are reordered before the test, reading invalid data that existed before
B_READING was cleared.
Fix this bug by changing test_bit to test_bit_acquire. This is
particularly important on arches with weak(er) memory ordering
(e.g. arm64).
Depends-On: 8238b45798 ("wait_on_bit: add an acquire memory barrier")
Depends-On: d6ffe6067a ("provide arch_test_bit_acquire for architectures that define test_bit")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
- A smatch warning fix for DM writecache locking in writecache_map.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmL1q20ACgkQxSPxCi2d
A1rZvggAxFFaj2V12TmJZ/D75ptDFfqsfEomsLKGqjq0pIYfhawWBnz0IgHd34vC
6Qy8bgoUGFQaexruFw6AjJQ3goaTJfFMy1/Nrf5Mvs7URlk7ckWgSz5ng9BRvePx
Qyp03BKjtWpu++uKJpKq1DrHXTor0J73dVkHCnAcqHHKmaiZdy9gf+5OdUMYcBX6
rNwLqlSqGGMG2TQp6/tUdytUsB2GyhAPs/uhTtTDa0glTwRvYWU0dhacuBqngLwr
G+UkPbE23s3NWMhvCK9qbTPT79prEgLrC/WXioeFBxJaV4LbXY/6hZ3817JpwjRv
aDxFXr8V6lWLJEeYY0MxImAGE2DUiA==
=yCb1
-----END PGP SIGNATURE-----
Merge tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- A few fixes for the DM verity and bufio changes in this merge window
- A smatch warning fix for DM writecache locking in writecache_map
* tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix some cases where the code sleeps with spinlock held
dm writecache: fix smatch warning about invalid return from writecache_map
dm verity: fix verity_parse_opt_args parsing
dm verity: fix DM_VERITY_OPTS_MAX value yet again
dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking
Commit b32d45824a ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
added a "NO_SLEEP" mode, it replaces a mutex with a spinlock, and it
is only usable when the device is in read-only mode (because the write
path may be sleeping while holding the dm_bufio_client lock).
However, there are still two points where the code could sleep even in
read-only mode. One is in __get_unclaimed_buffer -> __make_buffer_clean.
The other is in __try_evict_buffer -> __make_buffer_clean. These functions
will call __make_buffer_clean which sleeps if the buffer is being read.
Fix these cases so that if c->no_sleep is set __make_buffer_clean
will not be called and the buffer will be skipped instead.
Fixes: b32d45824a ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Historically none of the bufio code runs in interrupt context but with
the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit
5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
That said, the new tasklet usecase still doesn't require interrupts be
disabled by bufio (let alone conditionally restore them).
Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care
must be taken to properly synchronize between softirq and process
context, otherwise ABBA deadlock may occur. While it is unnecessary to
disable bottom-half preemption within a tasklet, we must consistently do
so in process context to ensure locking is in the proper order.
Fix these issues by switching from spin_lock_irq{save,restore} to using
spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member
in dm_bufio_client struct (that can be used unsafely if bufio must
recurse on behalf of some caller, e.g. block layer's submit_bio).
Fixes: 5721d4e5a9 ("dm verity: Add optional "try_verify_in_tasklet" feature")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather
than mutex for its locking.
- Add optional "try_verify_in_tasklet" feature to DM verity target.
This feature gives users the option to improve IO latency by using a
tasklet to verify, using hashes in bufio's cache, rather than wait
to schedule a work item via workqueue. But if there is a bufio cache
miss, or an error, then the tasklet will fallback to using workqueue.
- Incremental changes to both dm-bufio and the DM verity target to use
jump_label to minimize cost of branching associated with the niche
"try_verify_in_tasklet" feature. DM-bufio in particular is used by
quite a few other DM targets so it doesn't make sense to incur
additional bufio cost in those targets purely for the benefit of
this niche verity feature if the feature isn't ever used.
- Optimize verity_verify_io, which is used by both workqueue and
tasklet based verification, if FEC is not configured or tasklet
based verification isn't used.
- Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE
flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if
"try_verify_in_tasklet" is specified.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmLtYU0ACgkQxSPxCi2d
A1pIDwgAjQi7jSxN7n+Fb4sJLL5x3WvuVGcockIkucj+Pvr3nvijwkf27+kbCWhn
d4bDhA60gCebd87lf2PZTf8LL2+h9SLzFDTrgBVg5eC4O8aoQNrgwMMKVvYn+MmK
OShurwHXS/7iqCETFaUA7hVtH/NwSWzP7WL5+QIDVOWVGaTLnqdvA4TYSZnljEg2
c02bL2KK+ndsYYshDq7HnVuqr4hIBWKF6y0lApU42mfTCnghX8ZnUMG9pO9K+20X
qVfQH58CjOTP0MaHsddyR1sTKKZ1qY1HdoDhnlMVfZD5XqnCMhzefKoMxbxJKmJ3
7hS5w2tNxSx4yYWGj3dXHKhEZi0buA==
=ZBi4
-----END PGP SIGNATURE-----
Merge tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper updates from Mike Snitzer:
- Add flags argument to dm_bufio_client_create and introduce
DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather
than mutex for its locking.
- Add optional "try_verify_in_tasklet" feature to DM verity target.
This feature gives users the option to improve IO latency by using a
tasklet to verify, using hashes in bufio's cache, rather than wait to
schedule a work item via workqueue. But if there is a bufio cache
miss, or an error, then the tasklet will fallback to using workqueue.
- Incremental changes to both dm-bufio and the DM verity target to use
jump_label to minimize cost of branching associated with the niche
"try_verify_in_tasklet" feature. DM-bufio in particular is used by
quite a few other DM targets so it doesn't make sense to incur
additional bufio cost in those targets purely for the benefit of this
niche verity feature if the feature isn't ever used.
- Optimize verity_verify_io, which is used by both workqueue and
tasklet based verification, if FEC is not configured or tasklet based
verification isn't used.
- Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE
flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if
"try_verify_in_tasklet" is specified.
* tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet"
dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND
dm verity: only copy bvec_iter in verity_verify_io if in_tasklet
dm verity: optimize verity_verify_io if FEC not configured
dm verity: conditionally enable branching for "try_verify_in_tasklet"
dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP
dm verity: allow optional args to alter primary args handling
dm verity: Add optional "try_verify_in_tasklet" feature
dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag
dm bufio: Add flags argument to dm_bufio_client_create
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
Add an optional flag that ensures dm_bufio_client does not sleep
(primary focus is to service dm_bufio_get without sleeping). This
allows the dm-bufio cache to be queried from interrupt context.
To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock
instead of a mutex. Additionally, to avoid deadlocks, special care
must be taken so that dm-bufio does not sleep while holding the
spinlock.
But again: the scope of this no_sleep is initially confined to
dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to
avoid sleeping because __bufio_new avoids allocation for NF_GET.
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Add a flags argument to dm_bufio_client_create and update all the
callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag.
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Improve kernel code uniformity by combining the request operation type and
flags into a single variable. Change 'int rw' into 'enum req_op op' because
the name 'op' is what is used in the block layer to hold a request type.
Use the blk_opf_t and enum req_op types where appropriate to improve static
type checking.
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-24-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Combine the bi_op and bi_op_flags into the bi_opf member. Use the new
blk_opf_t type to improve static type checking. This patch does not
change any functionality.
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-22-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove the magic autofree semantics and require the callers to explicitly
call bio_init to initialize the bio.
This allows bio_free to catch accidental bio_put calls on bio_init()ed
bios as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
subsystem. Also enhance both the integrity and crypt targets to emit
events to via dm-audit.
- Various other simple code improvements and cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmGJlFkACgkQxSPxCi2d
A1pqwwf/YZ6kNKRQaKF1mbkkHOxa/ULf7qIhi/R0epwJu4j1RGsCACS34EqzLc4c
x15h6flCNj1IBVAqTvMUETYTjTLtyrcfD0yBRWYw2RL0ksHMHyMvd1r/7aE64+pj
EeZk9Xzcx3Gsq9GOzKfYA2AX0PrypkKSjgHK7hgv+Jh5heqkFcnMXSl3l7BQ6vbr
ue9joPSI7+6eVFMDn32KxyHzfm6zZo1nmKZ6tQBBHD1D9yBqWTAhXiyXhRA+BOYH
Tg5wE1fvZ/htyZNEc1cMRArzLF6q9pEU4r8j472N6IcJbhIJzSu0V60zVvexNWG3
fJSIWqlta1KFK8SQttmDmfFnJiFcyw==
=t097
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Add DM core support for emitting audit events through the audit
subsystem. Also enhance both the integrity and crypt targets to emit
events to via dm-audit.
- Various other simple code improvements and cleanups.
* tag 'for-5.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: log table creation error code
dm: make workqueue names device-specific
dm writecache: Make use of the helper macro kthread_run()
dm crypt: Make use of the helper macro kthread_run()
dm verity: use bvec_kmap_local in verity_for_bv_block
dm log writes: use memcpy_from_bvec in log_writes_map
dm integrity: use bvec_kmap_local in __journal_read_write
dm integrity: use bvec_kmap_local in integrity_metadata
dm: add add_disk() error handling
dm: Remove redundant flush_workqueue() calls
dm crypt: log aead integrity violations to audit subsystem
dm integrity: log audit events for dm-integrity target
dm: introduce audit event module for device mapper