linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-10 14:11:52 +00:00

Author	SHA1	Message	Date
Shakeel Butt	70a64b7919	memcg: dynamically allocate lruvec_stats To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we need to dynamically allocate lruvec_stats in the mem_cgroup_per_node structure. Also move the definition of lruvec_stats_percpu and lruvec_stats and related functions to the memcontrol.c to facilitate later patches. No functional changes in the patch. Link: https://lkml.kernel.org/r/20240501172617.678560-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-07 10:36:59 -07:00
Shakeel Butt	59142d87ab	memcg: reduce memory size of mem_cgroup_events_index Patch series "memcg: reduce memory consumption by memcg stats", v4. Most of the memory overhead of a memcg object is due to memcg stats maintained by the kernel. Since stats updates happen in performance critical codepaths, the stats are maintained per-cpu and numa specific stats are maintained per-node * per-cpu. This drastically increase the overhead on large machines i.e. large of CPUs and multiple numa nodes. This patch series tries to reduce the overhead by at least not allocating the memory for stats which are not memcg specific. This patch (of 8): mem_cgroup_events_index is a translation table to get the right index of the memcg relevant entry for the general vm_event_item. At the moment, it is defined as integer array. However on a typical system the max entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to use int as storage type of the array. For now just use int8_t as type and add a BUILD_BUG_ON(). Another benefit of this change is that the translation table fits in 2 cachelines while previously it would require 8 cachelines (assuming 64 bytes cacheline). Link: https://lkml.kernel.org/r/20240501172617.678560-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240501172617.678560-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-07 10:36:59 -07:00
David Hildenbrand	b8a2528835	mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET. More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory. [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/ Link: https://lkml.kernel.org/r/20240502085259.103784-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-07 10:36:58 -07:00
linke li	5ee9562c58	mm/swapfile: mark racy access on si->highest_bit In scan_swap_map_slots(), si->highest_bit can by changed by swap_range_alloc() concurrently. All reads on si->highest_bit except one is either protected by lock or read using READ_ONCE. So mark the one racy read on si->highest_bit as benign using READ_ONCE. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Link: https://lkml.kernel.org/r/tencent_912BC3E8B0291DA4A0028AB424076375DA07@qq.com Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:57 -07:00
Hao Ge	637a900b08	mm/rmap: change the type of we_locked from int to bool Change the type of we_locked from int to bool because folio_trylock return bool Link: https://lkml.kernel.org/r/20240428012049.8182-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:56 -07:00
Zi Yan	7491f3f348	mm/rmap: do not add fully unmapped large folio to deferred split list In __folio_remove_rmap(), a large folio is added to deferred split list if any page in a folio loses its final mapping. But it is possible that the folio is fully unmapped and adding it to deferred split list is unnecessary. For PMD-mapped THPs, that was not really an issue, because removing the last PMD mapping in the absence of PTE mappings would not have added the folio to the deferred split queue. However, for PTE-mapped THPs, which are now more prominent due to mTHP, they are always added to the deferred split queue. One side effect is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be unintentionally increased, making it look like there are many partially mapped folios -- although the whole folio is fully unmapped stepwise. Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs where possible starting from commit `b06dc281aa` ("mm/rmap: introduce folio_remove_rmap_[pte\|ptes\|pmd]()"). When it happens, a whole PTE-mapped folio is unmapped in one go and can avoid being added to deferred split list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be noise when we cannot batch-unmap a complete PTE-mapped folio in one go -- or where this type of batching is not implemented yet, e.g., migration. To avoid the unnecessary addition, folio->_nr_pages_mapped is checked to tell if the whole folio is unmapped. If the folio is already on deferred split list, it will be skipped, too. Note: commit 98046944a159 ("mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics") tried to exclude mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not fix the above issue. A fully unmapped PTE-mapped order-9 THP was still added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() the order-9 folio is folio_test_pmd_mappable(). Link: https://lkml.kernel.org/r/20240502132852.862138-1-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:56 -07:00
SeongJae Park	ade414bdf6	mm/damon/paddr: implement DAMOS filter type YOUNG DAMOS filter of type YOUNG is defined, but not yet implemented by any DAMON operations set. Add the implementation on 'paddr', the DAMON operations set for the physical address space. Link: https://lkml.kernel.org/r/20240426195247.100306-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:55 -07:00
SeongJae Park	2d8b24654f	mm/damon: add DAMOS filter type YOUNG Define yet another DAMOS filter type, YOUNG. Like anon and memcg, the type of filter will be applied to each page in the memory region, and see if the page is accessed since the last check. Based on the 'matching' parameter, the page is filtered out or in. Note that this commit is adding only the type definition. The implementation should be made by DAMON operations sets. A commit for the implementation on 'paddr' DAMON operations set will follow. Link: https://lkml.kernel.org/r/20240426195247.100306-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:55 -07:00
SeongJae Park	6daea38215	mm/damon/paddr: implement damon_folio_mkold() damon_pa_mkold() receives physical address, get the folio covering the address, and makes the folio as old. A following commit will reuse the internal logic for marking a given folio as old. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_mkold() to damon_folio_mkold_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:54 -07:00
SeongJae Park	180d928e55	mm/damon/paddr: implement damon_folio_young() Patch series "mm/damon: add a DAMOS filter type for page granularity access recheck". DAMON provides its best-effort accuracy-overhead tradeoff under the user-defined ranges of acceptable level of the monitoring accuracy and overhead. A recent discussion for tiered memory management support from DAMON[1] concluded that finding memory regions of specific access pattern with low overhead despite of low accuracy via DAMON first, and then double checking the access of the region again in a finer (e.g., page) granularity could be a useful strategy for some DAMOS schemes. Add a new type of DAMOS filter, namely 'young' for such a case. It checks each page of DAMOS target region is accessed since the last check, and filters it out or in if 'matching' parameter is 'true' or 'false', respectively. Because this is a filter type that applied in page granularity, the support depends on DAMON operations set, similar to 'anon' and 'memcg' DAMOS filter types. Implement the support on the DAMON operations set for the physical address space, 'paddr', since one of the expected usages[1] is based on the physical address space. [1] https://lore.kernel.org/r/20240227235121.153277-1-sj@kernel.org This patch (of 7): damon_pa_young() receives physical address, get the folio covering the address, and show if the folio is accessed since the last check. A following commit will reuse the internal logic for checking access to a given folio. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_young() to damon_folio_young_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240426195247.100306-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)	737019cf6a	mm: optimise vmf_anon_prepare() for VMAs without an anon_vma If the mmap_lock can be taken for read, we can call __anon_vma_prepare() while holding it, saving ourselves a trip back through the fault handler. Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)	73b4a0cd82	mm: fix some minor per-VMA lock issues in userfaultfd Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. Remove comment referencing unlock_vma() which doesn't exist. Fix the comment about lock_vma_under_rcu() which I just made incorrect. Link: https://lkml.kernel.org/r/20240426144506.1290619-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:54 -07:00
Matthew Wilcox (Oracle)	a373baed5a	mm: delay the check for a NULL anon_vma Instead of checking the anon_vma early in the fault path where all page faults pay the cost, delay it until we know we're going to need the anon_vma to be filled in. This will have a slight negative effect on the first fault in an anonymous VMA, but it shortens every other page fault. It also makes the code slightly cleaner as the anon and file backed fault handling look more similar. The Intel kernel test bot reports a 3x improvement in vm-scalability throughput with the small-allocs-mt test. This is clearly an extreme situation that won't be replicated in any real-world workload, but it's a nice win. https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:53 -07:00
Matthew Wilcox (Oracle)	3be5106059	mm: assert the mmap_lock is held in __anon_vma_prepare() Patch series "Improve anon_vma scalability for anon VMAs". We have a 3x throughput improvement reported by Intel's kernel test robot: https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ This is from delaying taking the mmap_lock for page faults until we actually need the mmap_lock in order to assign an anon_vma to the vma. It cleans up the page fault path a little by making the anon fault handler more similar to the file fault handler. This patch (of 4): Convert the comment into an assertion. Link: https://lkml.kernel.org/r/20240426144506.1290619-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240426144506.1290619-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:53 -07:00
Matthew Wilcox	e0ffb29bc5	mm: simplify thp_vma_allowable_order Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:53 -07:00
Kemeng Shi	dc6e0ae5b1	mm: remove stale comment __folio_mark_dirty The __folio_mark_dirty will not mark inode dirty any longer. Remove the stale comment of it. Link: https://lkml.kernel.org/r/20240425131724.36778-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:53 -07:00
Kemeng Shi	3b3e412e5f	mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh Call __wb_calc_thresh to calculate wb bg_thresh of gdtc in wb_over_bg_thresh to remove unnecessary wrap in wb_calc_thresh. Link: https://lkml.kernel.org/r/20240425131724.36778-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:52 -07:00
Kemeng Shi	fabd2e42bc	mm: correct calculation of wb's bg_thresh in cgroup domain wb_calc_thresh() is calculating wb's share of bg_thresh in the global domain. However in case of cgroup writeback this is not the right thing to do. Consider the following domain hierarchy: global domain (> 20G) / \ cgroup1 (10G) cgroup2 (10G) \| \| bdi wb1 wb2 and assume wb1 and wb2 have the same bandwidth and the background threshold is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb threshold in the global domain as (wb bandwidth) / (domain bandwidth) it returns bg_thresh for wb1 as 0.5G although it has nobody to compete against in cgroup1. Fix the problem by calculating wb's share of bg_thresh in the cgroup domain. Test as following: /* make it easier to observe the issue / echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs / run fio in wb1 / cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 / run fio in wb2 with a new shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 Before fix, the wrttien pages of wb1 and wb2 reported from toos/writeback/wb_monitor.py keep growing. After fix, rare written pages are accumulated. There is no obvious change in fio result. [jack@suse.cz: changelog rewording] Link: https://lkml.kernel.org/r/20240425131724.36778-3-shikemeng@huaweicloud.com Fixes: `74d3694433` ("writeback: Fix performance regression in wb_over_bg_thresh()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:52 -07:00
Kemeng Shi	13fc441284	mm: enable __wb_calc_thresh to calculate dirty background threshold Patch series "Fix and cleanups to page-writeback", v2. This series contains some random cleanups and a fix to correct calculation of wb's bg_thresh in cgroup domain. More details can be found respective patches. This patch (of 4): Originally, __wb_calc_thresh always calculate wb's share of dirty throttling threshold. By getting thresh of wb_domain from caller, __wb_calc_thresh could be used for both dirty throttling and dirty background threshold. This is a preparation to correct threshold calculation of wb in cgroup. Link: https://lkml.kernel.org/r/20240425131724.36778-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240425131724.36778-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:52 -07:00
Kemeng Shi	826881a7f6	writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Commit `8d92890bd6` ("mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead") removed NR_UNSTABLE_NFS and nr_reclaimable only contains dirty page now. Rename nr_reclaimable to nr_dirty properly. Link: https://lkml.kernel.org/r/20240423034643.141219-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:52 -07:00
Kemeng Shi	4b5bbc39d7	writeback: support retrieving per group debug writeback stats of bdi Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats of bdi. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) \| \| bdi wb1 wb2 /* per wb writeback info of bdi is collected */ cat wb_stats WbCgIno: 1 WbWriteback: 0 kB WbReclaimable: 0 kB WbDirtyThresh: 0 kB WbDirtied: 0 kB WbWritten: 0 kB WbWriteBandwidth: 102400 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 state: 1 WbCgIno: 4091 WbWriteback: 1792 kB WbReclaimable: 820512 kB WbDirtyThresh: 6004692 kB WbDirtied: `1820448` kB WbWritten: 999488 kB WbWriteBandwidth: 169020 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 WbCgIno: 4131 WbWriteback: 1120 kB WbReclaimable: 820064 kB WbDirtyThresh: 6004728 kB WbDirtied: 1822688 kB WbWritten: 1002400 kB WbWriteBandwidth: 153520 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 [shikemeng@huaweicloud.com: fix build problems] Link: https://lkml.kernel.org/r/20240423034643.141219-4-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240423034643.141219-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:51 -07:00
Kemeng Shi	e32e27009f	writeback: collect stats of all wb of bdi in bdi_debug_stats_show Patch series "Improve visibility of writeback", v5. This series tries to improve visilibity of writeback. Patch 1 make /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi instead of only writeback info in root cgroup. Patch 2 add a new debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback info. Patch 3 add wb_monitor.py to monitor basic writeback info of running system, more info could be added on demand. Patch 4 is a random cleanup. More details can be found in respective patches. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) \| \| bdi wb1 wb2 /* all writeback info of bdi is successfully collected / cat stats BdiWriteback: 4704 kB BdiReclaimable: 1294496 kB BdiDirtyThresh: 204208088 kB DirtyThresh: 195259944 kB BackgroundThresh: 32503588 kB BdiDirtied: 48519296 kB BdiWritten: 47225696 kB BdiWriteBandwidth: 1173892 kBps b_dirty: 1 b_io: 0 b_more_io: 1 b_dirty_time: 0 bdi_list: 1 state: 1 / per wb writeback info of bdi is collected / cat /sys/kernel/debug/bdi/252:16/wb_stats WbCgIno: 1 WbWriteback: 0 kB WbReclaimable: 0 kB WbDirtyThresh: 0 kB WbDirtied: 0 kB WbWritten: 0 kB WbWriteBandwidth: 102400 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 state: 1 WbCgIno: 4208 WbWriteback: 59808 kB WbReclaimable: 676480 kB WbDirtyThresh: 6004624 kB WbDirtied: 23348192 kB WbWritten: 22614592 kB WbWriteBandwidth: 593204 kBps b_dirty: 1 b_io: 1 b_more_io: 0 b_dirty_time: 0 state: 7 WbCgIno: 4249 WbWriteback: 144256 kB WbReclaimable: 432096 kB WbDirtyThresh: 6004344 kB WbDirtied: 25727744 kB WbWritten: 25154752 kB WbWriteBandwidth: 577904 kBps b_dirty: 0 b_io: 1 b_more_io: 0 b_dirty_time: 0 state: 7 The wb_monitor.py script output is as following: ./wb_monitor.py 252:16 -c writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 writeback reclaimable dirtied written avg_bw 252:16_1 0 0 0 0 102400 252:16_4284 672 820064 9230368 8410304 685612 252:16_4325 896 819840 10491264 9671648 652348 252:16 1568 1639904 19721632 18081952 1440360 ... This patch (of 5): /sys/kernel/debug/bdi/xxx/stats is supposed to show writeback information of whole bdi, but only writeback information of bdi in root cgroup is collected. So writeback information in non-root cgroup are missing now. To be more specific, considering following case: / create writeback cgroup / cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo $$ > cgroup.procs / do writeback in cgroup / fio -name test -filename=/dev/vdb ... / get writeback info of bdi / cat /sys/kernel/debug/bdi/xxx/stats The cat result unexpectedly implies that there is no writeback on target bdi. Fix this by collecting stats of all wb in bdi instead of only wb in root cgroup. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) \| \| bdi wb1 wb2 / all writeback info of bdi is successfully collected */ cat stats BdiWriteback: 2912 kB BdiReclaimable: 1598464 kB BdiDirtyThresh: 167479028 kB DirtyThresh: 195038532 kB BackgroundThresh: 32466728 kB BdiDirtied: 19141696 kB BdiWritten: 17543456 kB BdiWriteBandwidth: 1136172 kBps b_dirty: 2 b_io: 0 b_more_io: 1 b_dirty_time: 0 bdi_list: 1 state: 1 Link: https://lkml.kernel.org/r/20240423034643.141219-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240423034643.141219-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Brian Foster <bfoster@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:51 -07:00
Hariom Panthi	21e516b913	mm: vmalloc: dump page owner info if page is already mapped In vmap_pte_range, BUG_ON is called when page is already mapped, It doesn't give enough information to debug further. Dumping page owner information alongwith BUG_ON will be more useful in case of multiple page mapping. Example: [ 14.552875] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b923 [ 14.553440] flags: 0xbffff0000000000(node=0\|zone=2\|lastcpupid=0x3ffff) [ 14.554001] page_type: 0xffffffff() [ 14.554783] raw: 0bffff0000000000 0000000000000000 dead000000000122 0000000000000000 [ 14.555230] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 14.555768] page dumped because: remapping already mapped page [ 14.556172] page_owner tracks the page as allocated [ 14.556482] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid 80 (insmod), ts 14552004992, free_ts 0 [ 14.557286] prep_new_page+0xa8/0x10c [ 14.558052] get_page_from_freelist+0x7f8/0x1248 [ 14.558298] __alloc_pages+0x164/0x2b4 [ 14.558514] alloc_pages_mpol+0x88/0x230 [ 14.558904] alloc_pages+0x4c/0x7c [ 14.559157] load_module+0x74/0x1af4 [ 14.559361] __do_sys_init_module+0x190/0x1fc [ 14.559615] __arm64_sys_init_module+0x1c/0x28 [ 14.559883] invoke_syscall+0x44/0x108 [ 14.560109] el0_svc_common.constprop.0+0x40/0xe0 [ 14.560371] do_el0_svc_compat+0x1c/0x34 [ 14.560600] el0_svc_compat+0x2c/0x80 [ 14.560820] el0t_32_sync_handler+0x90/0x140 [ 14.561040] el0t_32_sync+0x194/0x198 [ 14.561329] page_owner free stack trace missing [ 14.562049] ------------[ cut here ]------------ [ 14.562314] kernel BUG at mm/vmalloc.c:113! Link: https://lkml.kernel.org/r/20240424111838.3782931-2-hariom1.p@samsung.com Signed-off-by: Hariom Panthi <hariom1.p@samsung.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Maninder Singh <maninder1.s@samsung.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:51 -07:00
David Hildenbrand	1bafe96e89	mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit `9445689f3b` ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit `5503fbf2b0` ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit `71a2c112a0` ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Link: https://lkml.kernel.org/r/20240424122630.495788-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:50 -07:00
Breno Leitao	78ec6f9df6	memcg: fix data-race KCSAN bug in rstats A data-race issue in memcg rstat occurs when two distinct code paths access the same 4-byte region concurrently. KCSAN detection triggers the following BUG as a result. BUG: KCSAN: data-race in __count_memcg_events / mem_cgroup_css_rstat_flush write to 0xffffe8ffff98e300 of 4 bytes by task 5274 on cpu 17: mem_cgroup_css_rstat_flush (mm/memcontrol.c:5850) cgroup_rstat_flush_locked (kernel/cgroup/rstat.c:243 (discriminator 7)) cgroup_rstat_flush (./include/linux/spinlock.h:401 kernel/cgroup/rstat.c:278) mem_cgroup_flush_stats.part.0 (mm/memcontrol.c:767) memory_numa_stat_show (mm/memcontrol.c:6911) <snip> read to 0xffffe8ffff98e300 of 4 bytes by task 410848 on cpu 27: __count_memcg_events (mm/memcontrol.c:725 mm/memcontrol.c:962) count_memcg_event_mm.part.0 (./include/linux/memcontrol.h:1097 ./include/linux/memcontrol.h:1120) handle_mm_fault (mm/memory.c:5483 mm/memory.c:5622) <snip> value changed: 0x00000029 -> 0x00000000 The race occurs because two code paths access the same "stats_updates" location. Although "stats_updates" is a per-CPU variable, it is remotely accessed by another CPU at cgroup_rstat_flush_locked()->mem_cgroup_css_rstat_flush(), leading to the data race mentioned. Considering that memcg_rstat_updated() is in the hot code path, adding a lock to protect it may not be desirable, especially since this variable pertains solely to statistics. Therefore, annotating accesses to stats_updates with READ/WRITE_ONCE() can prevent KCSAN splats and potential partial reads/writes. Link: https://lkml.kernel.org/r/20240424125940.2410718-1-leitao@debian.org Fixes: `9cee7e8ef3` ("mm: memcg: optimize parent iteration in memcg_rstat_updated()") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:50 -07:00
Matthew Wilcox (Oracle)	21db296aaf	mm: add kernel-doc for folio_mark_accessed() Convert the existing documentation to kernel-doc and remove references to pages. Link: https://lkml.kernel.org/r/20240424191914.361554-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:50 -07:00
Matthew Wilcox (Oracle)	9cbe4954c6	gup: use folios for gup_devmap Use try_grab_folio() instead of try_grab_page() so we get the folio back that we calculated, and then use folio_set_referenced() instead of SetPageReferenced(). Correspondingly, use gup_put_folio() to put any unneeded references. Link: https://lkml.kernel.org/r/20240424191914.361554-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:49 -07:00
Matthew Wilcox (Oracle)	53e45c4f6d	mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() All callers have a folio so we can remove this use of page_ref_sub_return(). Link: https://lkml.kernel.org/r/20240424191914.361554-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:49 -07:00
Matthew Wilcox (Oracle)	a568b4126b	userfault; expand folio use in mfill_atomic_install_pte() Call page_folio() a little earlier so we can use folio_mapping() instead of page_mapping(), saving a call to compound_head(). Link: https://lkml.kernel.org/r/20240423225552.4113447-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:48 -07:00
Matthew Wilcox (Oracle)	e18a9faf06	migrate: expand the use of folio in __migrate_device_pages() Removes a few calls to compound_head() and a call to page_mapping(). Link: https://lkml.kernel.org/r/20240423225552.4113447-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:48 -07:00
Matthew Wilcox (Oracle)	89f5c54b22	memory-failure: remove calls to page_mapping() This is mostly just inlining page_mapping() into the two callers. Link: https://lkml.kernel.org/r/20240423225552.4113447-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:48 -07:00
Matthew Wilcox (Oracle)	b650e1d2ae	mm/memory-failure: pass the folio to collect_procs_ksm() We've already calculated it, so pass it in instead of recalculating it in collect_procs_ksm(). Link: https://lkml.kernel.org/r/20240412193510.2356957-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:47 -07:00
Matthew Wilcox (Oracle)	0edb5b282a	mm/memory-failure: use folio functions throughout collect_procs() Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:47 -07:00
Matthew Wilcox (Oracle)	ee299e9849	mm/memory-failure: add some folio conversions to unpoison_memory Some of these folio APIs didn't exist when the unpoison_memory() conversion was done originally. Link: https://lkml.kernel.org/r/20240412193510.2356957-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:46 -07:00
Matthew Wilcox (Oracle)	03468a0f52	mm/memory-failure: convert hwpoison_user_mappings to take a folio Pass the folio from the callers, and use it throughout instead of hpage. Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:46 -07:00
Matthew Wilcox (Oracle)	5dba5c356a	mm/memory-failure: convert memory_failure() to use a folio Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:46 -07:00
Matthew Wilcox (Oracle)	6e8cda4c2c	mm: convert hugetlb_page_mapping_lock_write to folio The page is only used to get the mapping, so the folio will do just as well. Both callers already have a folio available, so this saves a call to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:46 -07:00
Matthew Wilcox (Oracle)	fed5348ee2	mm/memory-failure: convert shake_page() to shake_folio() Removes two calls to compound_head(). Move the prototype to internal.h; we definitely don't want code outside mm using it. Link: https://lkml.kernel.org/r/20240412193510.2356957-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:45 -07:00
Matthew Wilcox (Oracle)	b87f978dc7	mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILURE This function is only currently used by the memory-failure code, so we can omit it if we're not compiling in the memory-failure code. Link: https://lkml.kernel.org/r/20240412193510.2356957-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:45 -07:00
Matthew Wilcox (Oracle)	37bc2ff506	mm: return the address from page_mapped_in_vma() The only user of this function calls page_address_in_vma() immediately after page_mapped_in_vma() calculates it and uses it to return true/false. Return the address instead, allowing memory-failure to skip the call to page_address_in_vma(). Link: https://lkml.kernel.org/r/20240412193510.2356957-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:45 -07:00
Matthew Wilcox (Oracle)	f2b37197c2	mm/memory-failure: pass addr to __add_to_kill() Handle anon/file folios the same way as KSM & DAX folios by passing in the address. Link: https://lkml.kernel.org/r/20240412193510.2356957-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:45 -07:00
Matthew Wilcox (Oracle)	1c0501e831	mm/memory-failure: remove fsdax_pgoff argument from __add_to_kill Patch series "Some cleanups for memory-failure", v3. A lot of folio conversions, plus some other simplifications. This patch (of 11): Unify the KSM and DAX codepaths by calculating the addr in add_to_kill_fsdax() instead of telling __add_to_kill() to calculate it. Link: https://lkml.kernel.org/r/20240412193510.2356957-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240412193510.2356957-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:44 -07:00
Shakeel Butt	91882c1617	memcg: simple cleanup of stats update functions mod_memcg_lruvec_state() is never called from outside of memcontrol.c and with always irq disabled. So, replace it with the irq disabled version and add an assert that irq is disabled in the caller. Similarly mod_objcg_state() is not called from outside of memcontrol.c, so simply make it static and change it's name to __mod_objcg_state(). Link: https://lkml.kernel.org/r/20240420232505.2768428-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:44 -07:00
Kefeng Wang	6ed31ba392	mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp() Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the unnecessary FAULT_FLAG_ORIG_PTE_VALID check/pte_marker_entry_uffd_wp() in most pagefault, note, the function vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions, the difference is shown below, perf date, perf report -i perf.data.before \| grep vmf 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0 perf report -i perf.data.after \| grep vmf lat_pagefault -W 5 -N 5 /tmp/XXX latency before after diff average(8 tests) 0.262675 0.2600375 -0.0026375 Although it's a small, but the uffd_wp is a new feature than previous kernel, when the vma is not registered with UFFD_WP, let's avoid to execute the new logical, also adding __always_inline attribute to vmf_orig_pte_uffd_wp(), which make set_pte_range() only check VM_UFFD_WP flags without the function call. In addition, directly call the vmf_orig_pte_uffd_wp() in do_anonymous_page() and set_pte_range() to save an uffd_wp variable. Link: https://lkml.kernel.org/r/20240422030039.3293568-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:43 -07:00
Lance Yang	dce7d10be4	mm/madvise: optimize lazyfreeing with mTHP in madvise_free This patch optimizes lazyfreeing with PTE-mapped mTHP[1] (Inspired by David Hildenbrand[2]). We aim to avoid unnecessary folio splitting if the large folio is fully mapped within the target range. If a large folio is locked or shared, or if we fail to split it, we just leave it in place and advance to the next PTE in the range. But note that the behavior is changed; previously, any failure of this sort would cause the entire operation to give up. As large folios become more common, sticking to the old way could result in wasted opportunities. On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size \| Old \| New \| Change ------------------------------------------ 4KiB \| 0.590251 \| 0.590259 \| 0% 16KiB \| 2.990447 \| 0.185655 \| -94% 32KiB \| 2.547831 \| 0.104870 \| -95% 64KiB \| 2.457796 \| 0.052812 \| -97% 128KiB \| 2.281034 \| 0.032777 \| -99% 256KiB \| 2.230387 \| 0.017496 \| -99% 512KiB \| 2.189106 \| 0.010781 \| -99% 1024KiB \| 2.183949 \| 0.007753 \| -99% 2048KiB \| 0.002799 \| 0.002804 \| 0% [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com Link: https://lkml.kernel.org/r/20240418134435.6092-5-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:43 -07:00
Lance Yang	96ebdb0320	mm/memory: add any_dirty optional pointer to folio_pte_batch() This commit adds the any_dirty pointer as an optional parameter to folio_pte_batch() function. By using both the any_young and any_dirty pointers, madvise_free can make smarter decisions about whether to clear the PTEs when marking large folios as lazyfree. Link: https://lkml.kernel.org/r/20240418134435.6092-4-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:43 -07:00
Lance Yang	1b68112c40	mm/madvise: introduce clear_young_dirty_ptes() batch helper Patch series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free", v10. This patchset adds support for lazyfreeing multi-size THP (mTHP) without needing to first split the large folio via split_folio(). However, we still need to split a large folio that is not fully mapped within the target range. If a large folio is locked or shared, or if we fail to split it, we just leave it in place and advance to the next PTE in the range. But note that the behavior is changed; previously, any failure of this sort would cause the entire operation to give up. As large folios become more common, sticking to the old way could result in wasted opportunities. Performance Testing =================== On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size \| Old \| New \| Change ------------------------------------------ 4KiB \| 0.590251 \| 0.590259 \| 0% 16KiB \| 2.990447 \| 0.185655 \| -94% 32KiB \| 2.547831 \| 0.104870 \| -95% 64KiB \| 2.457796 \| 0.052812 \| -97% 128KiB \| 2.281034 \| 0.032777 \| -99% 256KiB \| 2.230387 \| 0.017496 \| -99% 512KiB \| 2.189106 \| 0.010781 \| -99% 1024KiB \| 2.183949 \| 0.007753 \| -99% 2048KiB \| 0.002799 \| 0.002804 \| 0% This patch (of 4): This commit introduces clear_young_dirty_ptes() to replace mkold_ptes(). By doing so, we can use the same function for both use cases (madvise_pageout and madvise_free), and it also provides the flexibility to only clear the dirty flag in the future if needed. Link: https://lkml.kernel.org/r/20240418134435.6092-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240418134435.6092-2-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:42 -07:00
Kefeng Wang	80e7502148	mm: swapfile: check usable swap device in __folio_throttle_swaprate() Skip blk_cgroup_congested() if there is no usable swap device since no swapin/out will occur, Thereby avoid taking swap_lock. The difference is shown below from perf date of CoW pagefault, perf report -g -i perf.data.swapon \| egrep "blk_cgroup_congested\|__folio_throttle_swaprate" 1.01% 0.16% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate 0.83% 0.80% page_fault2_pro [kernel.kallsyms] [k] blk_cgroup_congested perf report -g -i perf.data.swapoff \| egrep "blk_cgroup_congested\|__folio_throttle_swaprate" 0.15% 0.15% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate Link: https://lkml.kernel.org/r/20240418135644.2736748-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:42 -07:00
David Hildenbrand	d21f996b02	mm/huge_memory: improve split_huge_page_to_list_to_order() return value documentation The documentation is wrong and relying on it almost resulted in BUGs in new callers: ever since `fd4a7ac329` ("mm: migrate: try again if THP split is failed due to page refcnt") we return -EAGAIN on unexpected folio references, not -EBUSY. Let's fix that and also document which other return values we can currently see and why they could happen. [david@redhat.com: v2] Link: https://lkml.kernel.org/r/20240422194217.442933-1-david@redhat.com Link: https://lkml.kernel.org/r/20240418151834.216557-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:42 -07:00
Peter Xu	8430557fc5	mm/page_table_check: support userfault wr-protect entries Allow page_table_check hooks to check over userfaultfd wr-protect criteria upon pgtable updates. The rule is no co-existance allowed for any writable flag against userfault wr-protect flag. This should be better than `c2da319c2e`, where we used to only sanitize such issues during a pgtable walk, but when hitting such issue we don't have a good chance to know where does that writable bit came from [1], so that even the pgtable walk exposes a kernel bug (which is still helpful on triaging) but not easy to track and debug. Now we switch to track the source. It's much easier too with the recent introduction of page table check. There are some limitations with using the page table check here for userfaultfd wr-protect purpose: - It is only enabled with explicit enablement of page table check configs and/or boot parameters, but should be good enough to track at least syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for x86 [1]. We used to have DEBUG_VM but it's now off for most distros, while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which is similar. - It conditionally works with the ptep_modify_prot API. It will be bypassed when e.g. XEN PV is enabled, however still work for most of the rest scenarios, which should be the common cases so should be good enough. - Hugetlb check is a bit hairy, as the page table check cannot identify hugetlb pte or normal pte via trapping at set_pte_at(), because of the current design where hugetlb maps every layers to pte_t... For example, the default set_huge_pte_at() can invoke set_pte_at() directly and lose the hugetlb context, treating it the same as a normal pte_t. So far it's fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as long as supported (x86 only). It'll be a bigger problem when we'll define _PAGE_UFFD_WP differently at various pgtable levels, because then one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now we can leave this for later too. This patch also removes commit `c2da319c2e` altogether, as we have something better now. [1] https://lore.kernel.org/all/000000000000dce0530615c89210@google.com/ Link: https://lkml.kernel.org/r/20240417212549.2766883-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:41 -07:00
Peter Xu	3ccae1dc84	mm/hugetlb: assert hugetlb_lock in __hugetlb_cgroup_commit_charge This is similar to __hugetlb_cgroup_uncharge_folio() where it relies on holding hugetlb_lock. Add the similar assertion like the other one, since it looks like such things may help some day. Link: https://lkml.kernel.org/r/20240417211836.2742593-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:41 -07:00
Wei Yang	122ff80e12	mm/sparse: guard the size of mem_section is power of 2 We usually have this check, while commit `2a3cb8baef` ("mm/sparse: delete old sparse_init and enable new one") missed to take it. Link: https://lkml.kernel.org/r/20240416012559.4536-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:40 -07:00
Matthew Wilcox (Oracle)	3d84d89792	doc: improve the description of __folio_mark_dirty Patch series "Improve buffer head documentation", v3. Turn buffer head documentation into its own document, and make many general improvements to the docs. Obviously there is much more that could be done. Tested with make htmldocs. This patch (of 8): I've learned why it's safe to call __folio_mark_dirty() from mark_buffer_dirty() without holding the folio lock, so update the description to explain why. Link: https://lkml.kernel.org/r/20240416031754.4076917-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240416031754.4076917-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:39 -07:00
David Hildenbrand	2aa339120c	mm/ksm: remove page_mapcount() usage in stable_tree_search() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. If our folio has a stable node, it is a (small) KSM folio -- see folio_stable_node(). Let's use folio_mapcount() in stable_tree_search() instead, which results in no functional change. The mapcount > 1 check is a bit confusing, because that's usually a check for page sharing. Looks like the reason is that we are guaranteed to not exceed ksm_max_page_sharing for the tree KSM folio when merging with that. Let's update the documentation to make that clearer. Link: https://lkml.kernel.org/r/20240416172533.663418-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alex Shi <alexs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:38 -07:00
Yosry Ahmed	c074e1467f	mm: zswap: remove same_filled module params These knobs offer more fine-grained control to userspace than needed and directly expose/influence kernel implementation; remove them. For disabling same_filled handling, there is no logical reason to refuse storing same-filled pages more efficiently and opt for compression. Scanning pages for patterns may be an argument, but the page contents will be read into the CPU cache anyway during compression. Also, removing the same_filled handling code does not move the needle significantly in terms of performance anyway [1]. For disabling non_same_filled handling, it was added when the compressed pages in zswap were not being properly charged to memcgs, as workloads could escape the accounting with compression [2]. This is no longer the case after commit `f4840ccfca` ("zswap: memcg accounting"), and using zswap without compression does not make much sense. [1]https://lore.kernel.org/lkml/CAJD7tkaySFP2hBQw4pnZHJJwe3bMdjJ1t9VC2VJd=khn1_TXvA@mail.gmail.com/ [2]https://lore.kernel.org/lkml/19d5cdee-2868-41bd-83d5-6da75d72e940@maciej.szmigiero.name/ [yosryahmed@google.com: remove same_filled_pages from docs] Link: https://lkml.kernel.org/r/ZhxFVggdyvCo79jc@google.com Link: https://lkml.kernel.org/r/20240413022407.785696-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:38 -07:00
Yosry Ahmed	e87b881489	mm: zswap: move more same-filled pages checks outside of zswap_store() Currently, zswap_store() checks zswap_same_filled_pages_enabled, kmaps the folio, then calls zswap_is_page_same_filled() to check the folio contents. Move this logic into zswap_is_page_same_filled() as well (and rename it to use 'folio' while we are at it). This makes zswap_store() cleaner, and makes following changes to that logic contained within the helper. While we are at it: - Rename the insert_entry label to store_entry to match xa_store(). - Add comment headers for same-filled functions and the main API functions (load, store, invalidate, swapon, swapoff). No functional change intended. Link: https://lkml.kernel.org/r/20240413022407.785696-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:38 -07:00
Yosry Ahmed	82e0f8e47b	mm: zswap: refactor limit checking from zswap_store() Refactor limit and acceptance threshold checking outside of zswap_store(). This code will be moved around in a following patch, so it would be cleaner to move a function call around. Link: https://lkml.kernel.org/r/20240413022407.785696-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:37 -07:00
Yosry Ahmed	4ea3fa9dd2	mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full Patch series "zswap same-filled and limit checking cleanups", v3. Miscellaneous cleanups for limit checking and same-filled handling in the store path. This series was broken out of the "zswap: store zero-filled pages more efficiently" series [1]. It contains the cleanups and drops the main functional changes. [1]https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/ This patch (of 4): The cleanup code in zswap_store() is not pretty, particularly the 'shrink' label at the bottom that ends up jumping between cleanup labels. Instead of having a dedicated label to shrink the pool, just use zswap_pool_reached_full directly to figure out if the pool needs shrinking. zswap_pool_reached_full should be true if and only if the pool needs shrinking. The only caveat is that the value of zswap_pool_reached_full may be changed by concurrent zswap_store() calls between checking the limit and testing zswap_pool_reached_full in the cleanup code. This is fine because: - If zswap_pool_reached_full was true during limit checking then became false during the cleanup code, then someone else already took care of shrinking the pool and there is no need to queue the worker. That would be a good change. - If zswap_pool_reached_full was false during limit checking then became true during the cleanup code, then someone else hit the limit meanwhile. In this case, both threads will try to queue the worker, but it never gets queued more than once anyway. Also, calling queue_work() multiple times when the limit is hit could already happen today, so this isn't a significant change in any way. Link: https://lkml.kernel.org/r/20240413022407.785696-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240413022407.785696-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:37 -07:00
Suren Baghdasaryan	b5ba3a6427	userfaultfd: remove WRITE_ONCE when setting folio->index during UFFDIO_MOVE When folio is moved with UFFDIO_MOVE it gets locked before the rmap and index are modified. Due to the folio lock being already held, WRITE_ONCE() is not needed when setting the folio index. Remove it. Link: https://lkml.kernel.org/r/20240415020821.1152951-1-surenb@google.com Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:37 -07:00
Baolin Wang	231f8c7127	mm: page_alloc: allowing mTHP compaction to capture the freed page directly Currently, compaction_capture() does not allow lower-order allocations to directly capture the movable free pages, even though lower-order allocations might also be requesting movable pages, that can lead to more compaction scanning. And, with the enablement of mTHP, such situations will become more common. Thus allowing lower-order (mTHP) allocations of movable page types directly capture the movable free pages can avoid unnecessary compaction scanning, meanwhile that won't pollute the movable pageblock. With testing 1M mTHP compaction, it can be seen that compaction scanning is significantly reduced. mm-unstable patched Ops Compaction pages isolated 116598741.00 120946702.00 Ops Compaction migrate scanned 1764870054.00 1488621550.00 Ops Compaction free scanned 7707879039.00 4986299318.00 Ops Compact scan efficiency 22.90 29.85 Ops Compaction cost 73797.69 72933.48 Link: https://lkml.kernel.org/r/8118a5d66a034736a48433beddaca60ed78577c4.1712892329.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:37 -07:00
Kefeng Wang	ceca44991f	mm: filemap: batch mm counter updating in filemap_map_pages() Like copy_pte_range()/zap_pte_range(), make mm counter batch updating in filemap_map_pages(), since folios type are same(MM_SHMEMPAGES or MM_FILEPAGES) in filemap_map_pages(), only check the first folio type is enough, the 'lat_pagefault -P 1 file' test from lmbench shows 12% improvement, and the percpu_counter_add_batch() is gone from perf flame graph. Link: https://lkml.kernel.org/r/20240412064751.119015-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:36 -07:00
Kefeng Wang	1f2d8b4421	mm: move mm counter updating out of set_pte_range() Patch series "mm: batch mm counter updating in filemap_map_pages()", v3. Let's batch mm counter updating to accelerate filemap_map_pages(). This patch (of 2): In order to support batch mm counter updating in filemap_map_pages(), move mm counter updating out of set_pte_range(), the folios are file from filemap, and distinguish folios by vmf->flags and vma->vm_flags from another caller finish_fault(). Link: https://lkml.kernel.org/r/20240412064751.119015-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240412064751.119015-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:36 -07:00
Barry Song	d0f048ac39	mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters This helps to display the fragmentation situation of the swapfile, knowing the proportion of how much we haven't split large folios. So far, we only support non-split swapout for anon memory, with the possibility of expanding to shmem in the future. So, we add the "anon" prefix to the counter names. Link: https://lkml.kernel.org/r/20240412114858.407208-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:35 -07:00
Barry Song	ec33687c67	mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback counters Patch series "mm: add per-order mTHP alloc and swpout counters", v6. The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights. This patch (of 4): Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need. Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory. This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well. Link: https://lkml.kernel.org/r/20240412114858.407208-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:35 -07:00
Sidhartha Kumar	d199483c2b	mm/hugetlb: rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folios() dissolve_free_huge_pages() only uses folios internally, rename it to dissolve_free_hugetlb_folios() and change the comments which reference it. [akpm@linux-foundation.org: remove unneeded `extern'] Link: https://lkml.kernel.org/r/20240412182139.120871-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:35 -07:00
Sidhartha Kumar	54fa49b2e0	mm/hugetlb: convert dissolve_free_huge_pages() to folios Allows us to rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folio(). Convert one caller to pass in a folio directly and use page_folio() to convert the caller in mm/memory-failure. [sidhartha.kumar@oracle.com: remove unneeded `extern'] Link: https://lkml.kernel.org/r/71760ed4-e80d-493a-95ea-2545414b1aba@oracle.com [sidhartha.kumar@oracle.com: v2] Link: https://lkml.kernel.org/r/20240412182139.120871-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20240411164756.261178-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:35 -07:00
Alex Shi (tencent)	452e862f43	mm/ksm: replace set_page_stable_node by folio_set_stable_node Only single page could be reached where we set stable node after write protect, so use folio converted func to replace page's. And remove the unused func set_page_stable_node(). Link: https://lkml.kernel.org/r/20240411061713.1847574-11-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:34 -07:00
David Hildenbrand	85b67b0104	mm/ksm: rename get_ksm_page_flags to ksm_get_folio_flags As we are removing get_ksm_page_flags(), make the flags match the new function name. Link: https://lkml.kernel.org/r/20240411061713.1847574-10-alexs@kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Alex Shi <alexs@kernel.org> Reviewed-by: Alex Shi <alexs@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:34 -07:00
Alex Shi (tencent)	79899cce33	mm/ksm: convert chain series funcs and replace get_ksm_page In ksm stable tree all page are single, let's convert them to use and folios as well as stable_tree_insert/stable_tree_search funcs. And replace get_ksm_page() by ksm_get_folio() since there is no more needs. It could save a few compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-9-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:34 -07:00
Alex Shi (tencent)	40d707f33d	mm/ksm: use folio in write_protect_page Compound page is checked and skipped before write_protect_page() called, use folio to save a few compound_head checks. Link: https://lkml.kernel.org/r/20240411061713.1847574-8-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:34 -07:00
Alex Shi (tencent)	72556a4c06	mm/ksm: use ksm_get_folio in scan_get_next_rmap_item Save a compound_head call. Link: https://lkml.kernel.org/r/20240411061713.1847574-7-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:33 -07:00
Alex Shi (tencent)	6f528de298	mm/ksm: use folio in stable_node_dup Use ksm_get_folio() and save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-6-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:33 -07:00
Alex Shi (tencent)	9d5cc14093	mm/ksm: use folio in remove_stable_node Pages in stable tree are all single normal page, so uses ksm_get_folio() and folio_set_stable_node(), also saves 3 calls to compound_head(). Link: https://lkml.kernel.org/r/20240411061713.1847574-5-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:33 -07:00
Alex Shi (tencent)	b8b0ff244d	mm/ksm: add folio_set_stable_node Turn set_page_stable_node() into a wrapper folio_set_stable_node, and then use it to replace the former. we will merge them together after all place converted to folio. Link: https://lkml.kernel.org/r/20240411061713.1847574-4-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:33 -07:00
Alex Shi (tencent)	f39b6e2dc1	mm/ksm: use folio in remove_rmap_item_from_tree To save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-3-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:32 -07:00
Alex Shi (tencent)	b91f94729d	mm/ksm: add ksm_get_folio Patch series "transfer page to folio in KSM". This is the first part of page to folio transfer on KSM. Since only single page could be stored in KSM, we could safely transfer stable tree pages to folios. This patchset could reduce ksm.o 57kbytes from 2541776 bytes on latest akpm/mm-stable branch with CONFIG_DEBUG_VM enabled. It pass the KSM testing in LTP and kernel selftest. Thanks for Matthew Wilcox and David Hildenbrand's suggestions and comments! This patch (of 10): The ksm only contains single pages, so we could add a new func ksm_get_folio for get_ksm_page to use folio instead of pages to save a couple of compound_head calls. After all caller replaced, get_ksm_page will be removed. Link: https://lkml.kernel.org/r/20240411061713.1847574-1-alexs@kernel.org Link: https://lkml.kernel.org/r/20240411061713.1847574-2-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:32 -07:00
David Hildenbrand	7441d34922	mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio() Let's simplify and only print the page mapcount: we already print the large folio mapcount and the entire folio mapcount for large folios separately; that should be sufficient to figure out what's happening. While at it, print the page mapcount also if it had an underflow, filtering out only typed pages. Link: https://lkml.kernel.org/r/20240409192301.907377-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:31 -07:00
David Hildenbrand	f2f8a7a006	mm/migrate_device: use folio_mapcount() in migrate_vma_check_page() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's convert migrate_vma_check_page() to work on a folio internally so we can remove the page_mapcount() usage. Note that we reject any large folios. There is a lot more folio conversion to be had, but that has to wait for another day. No functional change intended. Link: https://lkml.kernel.org/r/20240409192301.907377-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:31 -07:00
David Hildenbrand	f0376c7109	mm/filemap: use folio_mapcount() in filemap_unaccount_folio() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's use folio_mapcount() instead of filemap_unaccount_folio(). No functional change intended, because we're only dealing with small folios. Link: https://lkml.kernel.org/r/20240409192301.907377-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:30 -07:00
David Hildenbrand	31ce0d7ef8	mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In add_page_for_migration(), we actually want to check if the folio is mapped shared, to reject such folios. So let's use folio_likely_mapped_shared() instead. For small folios, fully mapped THP, and hugetlb folios, there is no change. For partially mapped, shared THP, we should now do a better job at rejecting such folios. Link: https://lkml.kernel.org/r/20240409192301.907377-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:30 -07:00
David Hildenbrand	7115936ac1	mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. For tracing purposes, we use page_mapcount() in __alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds strange: total_migrated and total_reclaimed would count each page only once, not multiple times. But then, isolate_migratepages_range() adds each folio only once to the list. So for large folios, we would query the mapcount of the first page of the folio, which doesn't make too much sense for large folios. Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense as nr_migratepages is also incremented by the number of pages in the folio in case of successful migration. Link: https://lkml.kernel.org/r/20240409192301.907377-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:30 -07:00
David Hildenbrand	33d844bb84	mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We can only unmap full folios; page_mapped(), which we check here, is translated to folio_mapped() -- based on folio_mapcount(). So let's print the folio mapcount instead. Link: https://lkml.kernel.org/r/20240409192301.907377-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:29 -07:00
David Hildenbrand	0a7bda4801	mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's similarly check for folio_mapcount() underflows instead of page_mapcount() underflows like we do in zap_present_folio_ptes() now. Instead of the VM_BUG_ON(), we should actually be doing something like print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(), performing that check independently of DEBUG_VM. Link: https://lkml.kernel.org/r/20240409192301.907377-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:29 -07:00
David Hildenbrand	3aeea4fc83	mm/memory: use folio_mapcount() in zap_present_folio_ptes() We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In zap_present_folio_ptes(), let's simply check the folio mapcount(). If there is some issue, it will underflow at some point either way when unmapping. As indicated already in commit `10ebac4f95` ("mm/memory: optimize unmap/zap with PTE-mapped THP"), we already documented "If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.". There is no change for small folios. For large folios, we'll now catch more underflows when batch-unmapping, because instead of only testing the mapcount of the first subpage, we'll test if the folio mapcount underflows. Link: https://lkml.kernel.org/r/20240409192301.907377-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:29 -07:00
David Hildenbrand	05c5323b2a	mm: track mapcount of large folios in single value Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:28 -07:00
David Hildenbrand	46d62de7ad	mm/rmap: add fast-path for small folios when adding/removing/duplicating Let's add a fast-path for small folios to all relevant rmap functions. Note that only RMAP_LEVEL_PTE applies. This is a preparation for tracking the mapcount of large folios in a single value. Link: https://lkml.kernel.org/r/20240409192301.907377-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:28 -07:00
David Hildenbrand	c5541ba378	mm: follow_pte() improvements follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO VMAs. Let's perform some more sanity checks to make this exported function harder to abuse. Further, extend the doc a bit, it still focuses on the KVM use case with MMU notifiers. Drop the KVM+follow_pfn() comment, follow_pfn() is no more, and we have other users nowadays. Also extend the doc regarding refcounted pages and the interaction with MMU notifiers. KVM is one example that uses MMU notifiers and can deal with refcounted pages properly. VFIO is one example that doesn't use MMU notifiers, and to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)). Protection changes are less of a concern for users like VFIO: the behavior is similar to longterm-pinning a page, and getting the PTE protection changed afterwards. The primary concern with refcounted pages is use-after-free, which callers should be aware of. Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:27 -07:00
David Hildenbrand	29ae7d96d1	mm: pass VMA instead of MM to follow_pte() ... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:27 -07:00
Huang Ying	d4a34d7fb4	mm,swap: add document about RCU read lock and swapoff interaction During reviewing a patch to fix the race condition between free_swap_and_cache() and swapoff() [1], it was found that the document about how to prevent racing with swapoff isn't clear enough. Especially RCU read lock can prevent swapoff from freeing data structures. So, the document is added as comments. [1] https://lore.kernel.org/linux-mm/c8fe62d0-78b8-527a-5bef-ee663ccdc37a@huawei.com/ Link: https://lkml.kernel.org/r/20240407065450.498821-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:26 -07:00
Hao Ge	2bd9e6ee99	mm/mmap: make accountable_mapping return bool accountable_mapping() can return bool, so change it. Link: https://lkml.kernel.org/r/20240407063843.804274-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:26 -07:00
Hao Ge	38bc9c28c3	mm/mmap: make vma_wants_writenotify return bool vma_wants_writenotify() should return bool, so change it. Link: https://lkml.kernel.org/r/20240407062653.803142-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:26 -07:00
Ho-Ren (Jack) Chuang	cf93be18fa	memory tier: create CPUless memory tiers after obtaining HMAT info The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:26 -07:00
Ho-Ren (Jack) Chuang	a72a30af55	memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11. When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. This patch (of 2): Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Link: https://lkml.kernel.org/r/20240405000707.2670063-1-horenchuang@bytedance.com Link: https://lkml.kernel.org/r/20240405000707.2670063-2-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawie.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Hao Xiang <hao.xiang@bytedance.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:25 -07:00
Christoph Hellwig	77ddd726f9	mm,page_owner: don't remove __GFP_NOLOCKDEP in add_stack_record_to_list Otherwise we'll generate false lockdep positives. Link: https://lkml.kernel.org/r/20240429082828.1615986-1-hch@lst.de Fixes: `217b2119b9` ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:07 -07:00
Hailong.Liu	ac0476e8ca	mm/vmalloc: fix return value of vb_alloc if size is 0 vm_map_ram() uses IS_ERR() to validate the return value of vb_alloc(). If vm_map_ram(page, 0, 0) is executed, vb_alloc(0, GFP_KERNEL) would return NULL. In such a case, IS_ERR() cannot handle the return value and lead to kernel panic by vmap_pages_range_noflush() at last. To resolve this issue, return ERR_PTR(-EINVAL) if the size is 0. Link: https://lkml.kernel.org/r/20240426024149.21176-1-hailong.liu@oppo.com Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:06 -07:00
Kefeng Wang	30153e4466	mm: use memalloc_nofs_save() in page_cache_ra_order() See commit `f2c817bed5` ("mm: use memalloc_nofs_save in readahead path"), ensure that page_cache_ra_order() do not attempt to reclaim file-backed pages too, or it leads to a deadlock, found issue when test ext4 large folio. INFO: task DataXceiver for:7494 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:DataXceiver for state:D stack:0 pid:7494 ppid:1 flags:0x00000200 Call trace: __switch_to+0x14c/0x240 __schedule+0x82c/0xdd0 schedule+0x58/0xf0 io_schedule+0x24/0xa0 __folio_lock+0x130/0x300 migrate_pages_batch+0x378/0x918 migrate_pages+0x350/0x700 compact_zone+0x63c/0xb38 compact_zone_order+0xc0/0x118 try_to_compact_pages+0xb0/0x280 __alloc_pages_direct_compact+0x98/0x248 __alloc_pages+0x510/0x1110 alloc_pages+0x9c/0x130 folio_alloc+0x20/0x78 filemap_alloc_folio+0x8c/0x1b0 page_cache_ra_order+0x174/0x308 ondemand_readahead+0x1c8/0x2b8 page_cache_async_ra+0x68/0xb8 filemap_readahead.isra.0+0x64/0xa8 filemap_get_pages+0x3fc/0x5b0 filemap_splice_read+0xf4/0x280 ext4_file_splice_read+0x2c/0x48 [ext4] vfs_splice_read.part.0+0xa8/0x118 splice_direct_to_actor+0xbc/0x288 do_splice_direct+0x9c/0x108 do_sendfile+0x328/0x468 __arm64_sys_sendfile64+0x8c/0x148 invoke_syscall+0x4c/0x118 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x4c/0x1f8 el0t_64_sync_handler+0xc0/0xc8 el0t_64_sync+0x188/0x190 Link: https://lkml.kernel.org/r/20240426112938.124740-1-wangkefeng.wang@huawei.com Fixes: `793917d997` ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:06 -07:00
Maninder Singh	e7af4014b4	mm: page_owner: fix wrong information in dump_page_owner With commit `ea4b5b33bf` ("mm,page_owner: update metadata for tail pages"), new API __update_page_owner_handle was introduced and arguemnt was passed in wrong order from __set_page_owner and thus page_owner is giving wrong data. [ 15.982420] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid -1210279584 (insmod), ts 80, free_ts 0 Fixing the same. Correct output: [ 14.556482] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid 80 (insmod), ts 14552004992, free_ts 0 Link: https://lkml.kernel.org/r/20240424111838.3782931-1-hariom1.p@samsung.com Fixes: `ea4b5b33bf` ("mm,page_owner: update metadata for tail pages") Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Hariom Panthi <hariom1.p@samsung.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:05 -07:00
Al Viro	51d908b3db	swapon(2): open swap with O_EXCL ... eliminating the need to reopen block devices so they could be exclusively held. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:23:30 -04:00
Al Viro	798cb7f9ae	swapon(2)/swapoff(2): don't bother with block size once upon a time that used to matter; these days we do swap IO for swap devices at the level that doesn't give a damn about block size, buffer_head or anything of that sort - just attach the page to bio, set the location and size (the latter to PAGE_SIZE) and feed into queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:23:30 -04:00
Hyunmin Lee	7338999ca3	mm/slub: remove the check for NULL kmalloc_caches If the same size kmalloc cache already exists, it should not be created again. So there is the check for NULL kmalloc_caches before calling the kmalloc creation function. However, new_kmalloc_cache() itself checks NULL kmalloc_cahces before cache creation. Therefore, the NULL check is not necessary in this function. Signed-off-by: Hyunmin Lee <hyunminlr@gmail.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Co-developed-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2024-05-02 16:09:40 +02:00

1 2 3 4 5 ...

22438 Commits