Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It can be used after xas_load to check the order of loaded entries.
Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.
Added new test for xas_get_order, to make the test work, we have to export
xas_get_order with EXPORT_SYMBOL_GPL.
Also fix a sparse warning by checking the slot value with xa_entry instead
of accessing it directly, as suggested by Matthew Wilcox.
[kasong@tencent.com: simplify comment, sparse warning fix, per Matthew Wilcox]
Link: https://lkml.kernel.org/r/20240416071722.45997-4-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-4-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__filemap_add_folio only has two callers, one never passes hugetlb folio
and one always passes in hugetlb folio. So move the hugetlb related
cgroup charging out of it to make the code cleaner.
Link: https://lkml.kernel.org/r/20240415171857.19244-3-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/filemap: optimize folio adding and splitting", v4.
Currently, at least 3 tree walks are needed for filemap folio adding if
the folio is previously evicted. One for getting the order of current
slot, one for ranged conflict check, and one for another order retrieving.
If a split is needed, more walks are needed.
This series is trying to merge these walks, and speed up
filemap_add_folio, I see a 7.5% - 12.5% performance gain for fio stress
test.
So instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+10.4%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
This patch (of 4):
xas_split_alloc could fail with NOMEM, and in such case, it should abort
early instead of keep going and fail the xas_split below.
Link: https://lkml.kernel.org/r/20240416071722.45997-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Callers of folio_estimated_sharers() only care about "mapped shared vs.
mapped exclusively", not the exact estimate of sharers. Let's consolidate
and unify the condition users are checking. While at it clarify the
semantics and extend the discussion on the fuzziness.
Use the "likely mapped shared" terminology to better express what the
(adjusted) function actually checks.
Whether a partially-mappable folio is more likely to not be partially
mapped than partially mapped is debatable. In the future, we might be
able to improve our estimate for partially-mappable folios, though.
Note that we will now consistently detect "mapped shared" only if the
first subpage is actually mapped multiple times. When the first subpage
is not mapped, we will consistently detect it as "mapped exclusively".
This change should currently only affect the usage in
madvise_free_pte_range() and queue_folios_pte_range() for large folios: if
the first page was already unmapped, we would have skipped the folio.
[david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup]
Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the source folio is on deferred split list, it is likely some subpages
are not used. Split it before migration to avoid migrating unused
subpages.
Commit 616b837153 ("mm: thp: enable thp migration in generic path") did
not check if a THP is on deferred split list before migration, thus, the
destination THP is never put on deferred split list even if the source THP
might be. The opportunity of reclaiming free pages in a partially mapped
THP during deferred list scanning is lost, but no other harmful
consequence is present[1].
[1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/
[zi.yan@sent.com: fix an error in migrate_misplaced_folio()]
Link: https://lkml.kernel.org/r/20240326150031.569387-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20240322193304.522496-1-zi.yan@sent.com
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE
modifications preceded by pte clear. While iterating over PTEs of a large
folio, it only starts acquiring PTL from the first valid (present) PTE.
PTE modifications can temporarily set PTEs to pte_none. Consequently, the
initial PTEs of a large folio might be skipped in try_to_unmap_one().
For example, for an anon folio, if we skip PTE0, we may have PTE0 which is
still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after
try_to_unmap_one().
So folio will be still mapped, the folio fails to be reclaimed and is put
back to LRU in this round.
This also breaks up PTEs optimization such as CONT-PTE on this large folio
and may lead to accident folio_split() afterwards. And since a part of
PTEs are now swap entries, accessing those parts will introduce overhead -
do_swap_page. Although the kernel can withstand all of the above issues,
the situation still seems quite awkward and warrants making it more ideal.
The same race also occurs with small folios, but they have only one PTE,
thus, it won't be possible for them to be partially unmapped.
This patch holds PTL from PTE0, allowing us to avoid reading PTE values
that are in the process of being transformed. With stable PTE values, we
can ensure that this large folio is either completely reclaimed or that
all PTEs remain untouched in this round.
A corner case is that if we hold PTL from PTE0 and most initial PTEs have
been really unmapped before that, we may increase the duration of holding
PTL. Thus we only apply this optimization to folios which are still
entirely mapped (not in deferred_split list).
[akpm@linux-foundation.org: rewrap comment, per Matthew]
Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If called by __get_vm_area_node(), by open coding the field assignments of
'struct vm_struct *vm', and move the vm->flags and vm->caller assignments
into __get_vm_area_node(), the passed in arguments 'flags' and 'caller'
can be removed.
This alleviates overloaded arguments passed in for alloc_vmap_area().
Link: https://lkml.kernel.org/r/20240309044454.648888-1-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If there are too many folios that are recently evicted in a file, then
they will probably continue to be evicted. In such situation, there is no
positive effect to read-ahead this file since it is only a waste of IO.
The mmap_miss is increased in do_sync_mmap_readahead() and decreased in
both do_async_mmap_readahead() and filemap_map_pages(). In order to skip
read-ahead in above scenario, the mmap_miss have to increased exceed
MMAP_LOTSAMISS. This can be done by stop decreased mmap_miss when folio
has workingset flag. The async path is not to care because in above
scenario, it's hard to run into the async path.
[liushixin2@huawei.com: add comments]
Link: https://lkml.kernel.org/r/20240326065026.1910584-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20240322093555.226789-3-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Fix I/O high when memory almost met memcg limit", v2.
Recently, when install package in a docker which almost reached its memory
limit, the installer has no respond severely for more than 15 minutes.
During this period, I/O stays high(~1G/s) and influence the whole machine.
I've constructed a use case as follows:
1. create a docker:
$ cat test.sh
#!/bin/bash
docker rm centos7 --force
docker create --name centos7 --memory 4G --memory-swap 6G centos:7 /usr/sbin/init
docker start centos7
sleep 1
docker cp ./alloc_page centos7:/
docker cp ./reproduce.sh centos7:/
docker exec -it centos7 /bin/bash
2. try reproduce the problem in docker:
$ cat reproduce.sh
#!/bin/bash
while true; do
flag=$(ps -ef | grep -v grep | grep alloc_page| wc -l)
if [ "$flag" -eq 0 ]; then
/alloc_page &
fi
sleep 30
start_time=$(date +%s)
yum install -y expect > /dev/null 2>&1
end_time=$(date +%s)
elapsed_time=$((end_time - start_time))
echo "$elapsed_time seconds"
yum remove -y expect > /dev/null 2>&1
done
$ cat alloc_page.c:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#define SIZE 1*1024*1024 //1M
int main()
{
void *addr = NULL;
int i;
for (i = 0; i < 1024 * 6 - 50;i++) {
addr = (void *)malloc(SIZE);
if (!addr)
return -1;
memset(addr, 0, SIZE);
}
sleep(99999);
return 0;
}
We found that this problem is caused by a lot ot meaningless read-ahead.
Since the docker is almost met memory limit, the page will be reclaimed
immediately after read-ahead and will read-ahead again immediately. The
program is executed slowly and waste a lot of I/O resource.
These two patch aim to break the read-ahead in above scenario.
[1] https://lore.kernel.org/linux-mm/c2f4a2fa-3bde-72ce-66f5-db81a373fdbc@huawei.com/T/
[2] https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/
[3] https://lore.kernel.org/all/20240201173130.frpaqpy7iyzias5j@quack3/
This patch (of 2):
When filemap_add_folio() return -ENOMEM, break read-ahead loop like what
filemap_alloc_folio() does.
Link: https://lkml.kernel.org/r/20240322093555.226789-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20240322093555.226789-2-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit d0637c505f ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as
the MTE code works with the assumption tags save/restore is always
handling a folio with only one page.
The limitation should be removed as more and more ARM64 SoCs have this
feature. Co-existence of MTE and THP_SWAP becomes more and more
important.
This patch makes MTE tags saving support large folios, then we don't need
to split large folios into base pages for swapping out on ARM64 SoCs with
MTE any more.
arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all pages
in a large folio.
As now we are restoring tags based-on folio, in arch_swap_restore(), we
may increase some extra loops and early-exitings while refaulting a large
folio which is still in swapcache in do_swap_page(). In case a large
folio has nr pages, do_swap_page() will only set the PTE of the particular
page which is causing the page fault. Thus do_swap_page() runs nr times,
and each time, arch_swap_restore() will loop nr times for those subpages
in the folio. So right now the algorithmic complexity becomes O(nr^2).
Once we support mapping large folios in do_swap_page(), extra loops and
early-exitings will decrease while not being completely removed as a large
folio might get partially tagged in corner cases such as, 1. a large
folio in swapcache can be partially unmapped, thus, MTE tags for the
unmapped pages will be invalidated; 2. users might use mprotect() to set
MTEs on a part of a large folio.
arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who
needed it.
Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use sscanf() to directly parse the VMA range. No functional change is intended.
Link: https://lkml.kernel.org/r/20240322120551.818764-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add some description of the hugetlb migration strategy.
Link: https://lkml.kernel.org/r/63fb16e7a4ebc5cb69ce655af86e29b2d8e9ba34.1709719720.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As discussed in previous thread [1], there is an inconsistency when
handing hugetlb migration. When handling the migration of freed hugetlb,
it prevents fallback to other NUMA nodes in
alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use
hugetlb, it allows fallback to other NUMA nodes in
alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
and might result in unexpected failures when node bound workloads doesn't
get what is asssumed available.
To make hugetlb migration strategy more clear, we should list all the scenarios
of hugetlb migration and analyze whether allocation fallback is permitted:
1) Memory offline: will call dissolve_free_huge_pages() to free the
freed hugetlb, and call do_migrate_range() to migrate the in-use
hugetlb. Both can break the per-node hugetlb pool, but as this is an
explicit offlining operation, no better choice. So should allow the
hugetlb allocation fallback.
2) Memory failure: same as memory offline. Should allow fallback to a
different node might be the only option to handle it, otherwise the
impact of poisoned memory can be amplified.
3) Longterm pinning: will call migrate_longterm_unpinnable_pages() to
migrate in-use and not-longterm-pinnable hugetlb, which can break the
per-node pool. But we should fail to longterm pinning if can not
allocate on current node to avoid breaking the per-node pool.
4) Syscalls (mbind, migrate_pages, move_pages): these are explicit
users operation to move pages to other nodes, so fallback to other
nodes should not be prohibited.
5) alloc_contig_range: used by CMA allocation and virtio-mem
fake-offline to allocate given range of pages. Now the freed hugetlb
migration is not allowed to fallback, to keep consistency, the in-use
hugetlb migration should be also not allowed to fallback.
6) alloc_contig_pages: used by kfence, pgtable_debug etc. The strategy
should be consistent with that of alloc_contig_range().
Based on the analysis of the various scenarios above, introducing a new
helper to determine whether fallback is permitted according to the
migration reason..
[1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/
Link: https://lkml.kernel.org/r/3519fcd41522817307a05b40fb551e2e17e68101.1709719720.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "make the hugetlb migration strategy consistent", v2.
As discussed in previous thread [1], there is an inconsistency when
handling hugetlb migration. When handling the migration of freed hugetlb,
it prevents fallback to other NUMA nodes in
alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use
hugetlb, it allows fallback to other NUMA nodes in
alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
and might result in unexpected failures when node bound workloads doesn't
get what is asssumed available.
This patchset tries to make the hugetlb migration strategy more clear
and consistent. Please find details in each patch.
[1]
https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/
This patch (of 2):
To support different hugetlb allocation strategies during hugetlb
migration based on various migration reasons, record the migration reason
in the migration_target_control structure as a preparation.
Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When allocating a new memory area where the mapping address range is
known, it is observed that the vmap_node->busy.lock is acquired twice.
The first acquisition occurs in the alloc_vmap_area() function when
inserting the vm area into the vm mapping red-black tree. The second
acquisition occurs in the setup_vmalloc_vm() function when updating the
properties of the vm, such as flags and address, etc.
Combine these two operations together in alloc_vmap_area(), which improves
scalability when the vmap_node->busy.lock is contended. By doing so, the
need to acquire the lock twice can also be eliminated to once.
With the above change, tested on intel sapphire rapids platform(224 vcpu),
a 4% performance improvement is gained on
stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), which is the
stress test of thread creations.
Link: https://lkml.kernel.org/r/20240307021440.64967-1-rulin.huang@intel.com
Co-developed-by: "Chen, Tim C" <tim.c.chen@intel.com>
Signed-off-by: "Chen, Tim C" <tim.c.chen@intel.com>
Co-developed-by: "King, Colin" <colin.king@intel.com>
Signed-off-by: "King, Colin" <colin.king@intel.com>
Signed-off-by: rulinhuang <rulin.huang@intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kmemleak ia a memory leak checker. KASAN is also a memory checker but it
focuses more on finding out-of-bounds and use-after-free bugs. Since
kmemleak is inherently slow especially on systems with large number of
CPUs, adding KASAN instrumentation will make it slower even more. As
kmemleak is not for production use, the utility of enabling KASAN there is
questionable.
This patch disables KASAN instrumentation for configurations that enable
both of them to slightly reduce performance overhead.
Link: https://lkml.kernel.org/r/20240307190548.963626-3-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/kmemleak: Minor cleanup & performance tuning".
This series contains 2 simple cleanup patches to slightly reduce memory
and performance overhead.
This patch (of 2):
With commit 56a61617dd ("mm: use stack_depot for recording kmemleak's
backtrace"), the size of kmemleak_object has been reduced by 128 bytes for
64-bit arches. The replacement "depot_stack_handle_t trace_handle" is
actually just 4 bytes long leaving a hole of 4 bytes. By moving up
trace_handle to another existing 4-byte hold, we can save 8 more bytes
from kmemleak_object reducing its overall size from 248 to 240 bytes.
Link: https://lkml.kernel.org/r/20240307190548.963626-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20240307190548.963626-2-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
nr_stored was introduced by commit b5ba474f3f ("zswap: shrink zswap pool
based on memory pressure") as a per zswap_pool counter of the number of
stored pages that are not same-filled pages. It is used in
zswap_shrinker_count() to scale the number of freeable compressed pages by
the compression ratio. That is, to reduce the amount of writeback from
zswap with higher compression ratios as the ROI from IO diminishes.
Later on, commit bf9b7df23c ("mm/zswap: global lru and shrinker shared
by all zswap_pools") made the shrinker global (not per zswap_pool), and
replaced nr_stored with nr_zswap_stored (initially introduced as
zswap.nr_stored), which is now a global counter.
The counter is now awfully close to zswap_stored_pages. The only
difference is that the latter also includes same-filled pages. Also, when
memcgs are enabled, we use memcg_page_state(memcg, MEMCG_ZSWAPPED), which
includes same-filled pages anyway (i.e. equivalent to
zswap_stored_pages).
Use zswap_stored_pages instead in zswap_shrinker_count() to keep things
consistent whether memcgs are enabled or not, and add a comment about the
number of freeable pages possibly being scaled down more than it should if
we have lots of same-filled pages (i.e. inflated compression ratio).
Remove nr_zswap_stored and one atomic operation in the store and free
paths.
Link: https://lkml.kernel.org/r/20240322001001.1562517-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
expand() currently updates vmstat for every subpage. This is unnecessary,
since they're all of the same zone and migratetype.
Count added pages locally, then do a single vmstat update.
Link: https://lkml.kernel.org/r/20240327190111.GC7597@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function is now supposed to be called only on a single pageblock and
checks start_pfn and end_pfn accordingly. Rename it to make this more
obvious and drop the end_pfn parameter which can be determined trivially
and none of the callers use it for anything else.
Also make the (now internal) end_pfn exclusive, which is more common.
Link: https://lkml.kernel.org/r/81b1d642-2ec0-49f5-89fc-19a3828419ff@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Free page accounting currently happens a bit too high up the call stack,
where it has to deal with guard pages, compaction capturing, block
stealing and even page isolation. This is subtle and fragile, and makes
it difficult to hack on the code.
Now that type violations on the freelists have been fixed, push the
accounting down to where pages enter and leave the freelist.
[hannes@cmpxchg.org: undo unrelated drive-by line wrap]
Link: https://lkml.kernel.org/r/20240327185736.GA7597@cmpxchg.org
[hannes@cmpxchg.org: remove unused page parameter from account_freepages()]
Link: https://lkml.kernel.org/r/20240327185831.GB7597@cmpxchg.org
[baolin.wang@linux.alibaba.com: fix free page accounting]
Link: https://lkml.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang@linux.alibaba.com
[andriy.shevchenko@linux.intel.com: avoid defining unused function]
Link: https://lkml.kernel.org/r/20240423161506.2637177-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20240320180429.678181-11-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Page isolation currently sets MIGRATE_ISOLATE on a block, then drops
zone->lock and scans the block for straddling buddies to split up.
Because this happens non-atomically wrt the page allocator, it's possible
for allocations to get a buddy whose first block is a regular pcp
migratetype but whose tail is isolated. This means that in certain cases
memory can still be allocated after isolation. It will also trigger the
freelist type hygiene warnings in subsequent patches.
start_isolate_page_range()
isolate_single_pageblock()
set_migratetype_isolate(tail)
lock zone->lock
move_freepages_block(tail) // nop
set_pageblock_migratetype(tail)
unlock zone->lock
__rmqueue_smallest()
del_page_from_freelist(head)
expand(head, head_mt)
WARN(head_mt != tail_mt)
start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
for (pfn = start_pfn, pfn < end_pfn)
if (PageBuddy())
split_free_page(head)
Introduce a variant of move_freepages_block() provided by the allocator
specifically for page isolation; it moves free pages, converts the block,
and handles the splitting of straddling buddies while holding zone->lock.
The allocator knows that pageblocks and buddies are always naturally
aligned, which means that buddies can only straddle blocks if they're
actually >pageblock_order. This means the search-and-split part can be
simplified compared to what page isolation used to do.
Also tighten up the page isolation code around the expectations of which
pages can be large, and how they are freed.
Based on extensive discussions with and invaluable input from Zi Yan.
[hannes@cmpxchg.org: work around older gcc warning]
Link: https://lkml.kernel.org/r/20240321142426.GB777580@cmpxchg.org
Link: https://lkml.kernel.org/r/20240320180429.678181-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This avoids changing migratetype after move_freepages() or
move_freepages_block(), which is error prone. It also prepares for
upcoming changes to fix move_freepages() not moving free pages partially
in the range.
Link: https://lkml.kernel.org/r/20240320180429.678181-9-hannes@cmpxchg.org
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are three freeing paths that read the page's migratetype
optimistically before grabbing the zone lock. When this races with block
stealing, those pages go on the wrong freelist.
The paths in question are:
- when freeing >costly orders that aren't THP
- when freeing pages to the buddy upon pcp lock contention
- when freeing pages that are isolated
- when freeing pages initially during boot
- when freeing the remainder in alloc_pages_exact()
- when "accepting" unaccepted VM host memory before first use
- when freeing pages during unpoisoning
None of these are so hot that they would need this optimization at the
cost of hampering defrag efforts. Especially when contrasted with the
fact that the most common buddy freeing path - free_pcppages_bulk - is
checking the migratetype under the zone->lock just fine.
In addition, isolated pages need to look up the migratetype under the lock
anyway, which adds branches to the locked section, and results in a double
lookup when the pages are in fact isolated.
Move the lookups into the lock.
Link: https://lkml.kernel.org/r/20240320180429.678181-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, page block type conversion during fallbacks, atomic
reservations and isolation can strand various amounts of free pages on
incorrect freelists.
For example, fallback stealing moves free pages in the block to the new
type's freelists, but then may not actually claim the block for that type
if there aren't enough compatible pages already allocated.
In all cases, free page moving might fail if the block straddles more than
one zone, in which case no free pages are moved at all, but the block type
is changed anyway.
This is detrimental to type hygiene on the freelists. It encourages
incompatible page mixing down the line (ask for one type, get another) and
thus contributes to long-term fragmentation.
Split the process into a proper transaction: check first if conversion
will happen, then try to move the free pages, and only if that was
successful convert the block to the new type.
[baolin.wang@linux.alibaba.com: fix allocation failures with CONFIG_CMA]
Link: https://lkml.kernel.org/r/a97697e0-45b0-4f71-b087-fdc7a1d43c0e@linux.alibaba.com
Link: https://lkml.kernel.org/r/20240320180429.678181-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a block is partially outside the zone of the cursor page, the
function cuts the range to the pivot page instead of the zone start. This
can leave large parts of the block behind, which encourages incompatible
page mixing down the line (ask for one type, get another), and thus
long-term fragmentation.
This triggers reliably on the first block in the DMA zone, whose start_pfn
is 1. The block is stolen, but everything before the pivot page (which
was often hundreds of pages) is left on the old list.
Link: https://lkml.kernel.org/r/20240320180429.678181-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When claiming a block during compaction isolation, move any remaining free
pages to the correct freelists as well, instead of stranding them on the
wrong list. Otherwise, this encourages incompatible page mixing down the
line, and thus long-term fragmentation.
Link: https://lkml.kernel.org/r/20240320180429.678181-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The buddy allocator coalesces compatible blocks during freeing, but it
doesn't update the types of the subblocks to match. When an allocation
later breaks the chunk down again, its pieces will be put on freelists of
the wrong type. This encourages incompatible page mixing (ask for one
type, get another), and thus long-term fragmentation.
Update the subblocks when merging a larger chunk, such that a later
expand() will maintain freelist type hygiene.
Link: https://lkml.kernel.org/r/20240320180429.678181-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move direct freeing of isolated pages to the lock-breaking block in the
second loop. This saves an unnecessary migratetype reassessment.
Minor comment and local variable scoping cleanups.
Link: https://lkml.kernel.org/r/20240320180429.678181-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: page_alloc: freelist migratetype hygiene", v4.
The page allocator's mobility grouping is intended to keep unmovable pages
separate from reclaimable/compactable ones to allow on-demand
defragmentation for higher-order allocations and huge pages.
Currently, there are several places where accidental type mixing occurs:
an allocation asks for a page of a certain migratetype and receives
another. This ruins pageblocks for compaction, which in turn makes
allocating huge pages more expensive and less reliable.
The series addresses those causes. The last patch adds type checks on all
freelist movements to prevent new violations being introduced.
The benefits can be seen in a mixed workload that stresses the machine
with a memcache-type workload and a kernel build job while periodically
attempting to allocate batches of THP. The following data is aggregated
over 50 consecutive defconfig builds:
VANILLA PATCHED
Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
Huge page success rate is higher, allocation latencies are shorter and
more predictable.
Stealing (fallback) rate is drastically reduced. Notably, while the
vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
kernel enters a steady state once the distribution of block types is
adequate for the workload. Steals over 50 runs:
VANILLA PATCHED
1504.0 227.0
1557.0 6.0
1391.0 13.0
1080.0 26.0
1057.0 40.0
1156.0 6.0
805.0 46.0
736.0 20.0
1747.0 2.0
1699.0 34.0
1269.0 13.0
1858.0 12.0
907.0 4.0
727.0 2.0
563.0 2.0
3094.0 2.0
10211.0 3.0
2621.0 1.0
5508.0 2.0
1060.0 2.0
538.0 3.0
5773.0 2.0
2199.0 0.0
3781.0 2.0
1387.0 1.0
4977.0 0.0
2865.0 1.0
1814.0 1.0
3739.0 1.0
6857.0 0.0
382.0 0.0
407.0 1.0
3784.0 0.0
297.0 0.0
298.0 0.0
6636.0 0.0
4188.0 0.0
242.0 0.0
9960.0 0.0
5816.0 0.0
354.0 0.0
287.0 0.0
261.0 0.0
140.0 1.0
2065.0 0.0
312.0 0.0
331.0 0.0
164.0 0.0
465.0 1.0
219.0 0.0
Type mismatches are down too. Those count every time an allocation
request asks for one migratetype and gets another. This can still occur
minimally in the patched kernel due to non-stealing fallbacks, but it's
quite rare and follows the pattern of overall fallbacks - once the block
type distribution settles, mismatches cease as well:
VANILLA: PATCHED:
182602.0 268.0
135794.0 20.0
88619.0 19.0
95973.0 0.0
129590.0 0.0
129298.0 0.0
147134.0 0.0
230854.0 0.0
239709.0 0.0
137670.0 0.0
132430.0 0.0
65712.0 0.0
57901.0 0.0
67506.0 0.0
63565.0 4.0
34806.0 0.0
42962.0 0.0
32406.0 0.0
38668.0 0.0
61356.0 0.0
57800.0 0.0
41435.0 0.0
83456.0 0.0
65048.0 0.0
28955.0 0.0
47597.0 0.0
75117.0 0.0
55564.0 0.0
38280.0 0.0
52404.0 0.0
26264.0 0.0
37538.0 0.0
19671.0 0.0
30936.0 0.0
26933.0 0.0
16962.0 0.0
44554.0 0.0
46352.0 0.0
24995.0 0.0
35152.0 0.0
12823.0 0.0
21583.0 0.0
18129.0 0.0
31693.0 0.0
28745.0 0.0
33308.0 0.0
31114.0 0.0
35034.0 0.0
12111.0 0.0
24885.0 0.0
Compaction work is markedly reduced despite much better THP rates.
In the vanilla kernel, reclaim seems to have been driven primarily by
watermark boosting that happens as a result of fallbacks. With those all
but eliminated, watermarks average lower and kswapd does less work. The
uptick in direct reclaim is because THP requests have to fend for
themselves more often - which is intended policy right now. Aggregate
reclaim activity is lowered significantly, though.
This patch (of 10):
The idea behind the cache is to save get_pageblock_migratetype() lookups
during bulk freeing. A microbenchmark suggests this isn't helping,
though. The pcp migratetype can get stale, which means that bulk freeing
has an extra branch to check if the pageblock was isolated while on the
pcp.
While the variance overlaps, the cache write and the branch seem to make
this a net negative. The following test allocates and frees batches of
10,000 pages (~3x the pcp high marks to trigger flushing):
Before:
8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
19 context-switches # 4.341 /sec ( +- 3.24% )
0 cpu-migrations # 0.000 /sec
17,440 page-faults # 3.984 K/sec ( +- 2.90% )
41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
After:
8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
22 context-switches # 5.160 /sec ( +- 3.23% )
0 cpu-migrations # 0.000 /sec
17,443 page-faults # 4.091 K/sec ( +- 2.90% )
40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
A side effect is that this also ensures that pages whose pageblock gets
stolen while on the pcplist end up on the right freelist and we don't
perform potentially type-incompatible buddy merges (or skip merges when we
shouldn't), which is likely beneficial to long-term fragmentation
management, although the effects would be harder to measure. Settle for
simpler and faster code as justification here.
Link: https://lkml.kernel.org/r/20240320180429.678181-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20240320180429.678181-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The script calculates a mininum required size of hugetlb memories, but
it'll stop working with <1MB huge page sizes, reporting all zeros even if
huge pages are available.
In reality, the calculation doesn't really need to be as complicated
either. Make it simpler and work for KB-level hugepages too.
[peterx@redhat.com: run_vmtests.sh: fix hugetlb mem size calculation]
Link: https://lkml.kernel.org/r/20240403200324.1603493-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20240321215047.678172-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Nico Pache <npache@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make PageMappingFlags() return bool like folio_mapping_flags().
Link: https://lkml.kernel.org/r/20240321030712.80618-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make __PageMovable() return bool like __folio_test_movable().
Link: https://lkml.kernel.org/r/20240321032256.82063-1-gehao@kylinos.cn
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, VA exhaustion is being checked by passing a hint to mmap() and
expecting it to fail.
While populating the lower VA space, mmap() fails because we have
exhausted the space.
Then, in validate_lower_address_hint(), because mmap() fails, we
confirm that we have indeed exhausted the space. There is a circular
logic involved here.
Assume that there is a bug in mmap(), also assume that it exists
independent of whether you pass a hint address or not; that for some
reason it is not able to find a 1GB chunk. My idea is to assert the
exhaustion against some other method.
This patch makes a stricter test by successful
write() calls from /proc/self/maps to a dump file, confirming that a free
chunk is indeed not available.
[dev.jain@arm.com: replace SZ_1GB with MAP_CHUNK_SIZE, tidy-up]
Link: https://lkml.kernel.org/r/20240325042653.867055-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20240321103522.516097-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We no longer have destructors or dtors, merely a page flag (technically a
page type flag, but that's an implementation detail). Remove
__clear_hugetlb_destructor, fix up comments and the occasional variable
name.
Link: https://lkml.kernel.org/r/20240321142448.1645400-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For pages that have a page_type, set the mapcount to 0, which will reduce
the confusion in people reading page dumps ("Why does this page have a
mapcount of -128?"). Now that hugetlbfs is a page_type, read the
entire_mapcount for any large folio; this is fine for all folios as no
user reuses the entire_mapcount field.
For pages which do not have a page type, do not print it to reduce
clutter.
Link: https://lkml.kernel.org/r/20240321142448.1645400-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reclaim the Slab page flag by using a spare bit in PageType. We are
perennially short of page flags for various purposes, and now that the
original SLAB allocator has been retired, SLUB does not use the
mapcount/page_type field. This lets us remove a number of special cases
for ignoring mapcount on Slab pages.
[willy@infradead.org: update vmcoreinfo]
Link: https://lkml.kernel.org/r/ZgGV-O8WYQ_83kxp@casper.infradead.org
Link: https://lkml.kernel.org/r/20240321142448.1645400-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can call it only once instead of twice.
Link: https://lkml.kernel.org/r/20240321142448.1645400-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that prep_compound_page() initialises folio->_deferred_list,
folio_prep_large_rmappable()'s only purpose is to set the large_rmappable
flag, so inline it into the two callers. Take the opportunity to convert
the large_rmappable definition from PAGEFLAG to FOLIO_FLAG and remove the
existance of PageTestLargeRmappable and friends.
Link: https://lkml.kernel.org/r/20240321142448.1645400-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Various significant MM patches".
These patches all interact in annoying ways which make it tricky to send
them out in any way other than a big batch, even though there's not really
an overarching theme to connect them.
The big effects of this patch series are:
- folio_test_hugetlb() becomes reliable, even when called without a
page reference
- We free up PG_slab, and we could always use more page flags
- We no longer need to check PageSlab before calling page_mapcount()
This patch (of 9):
For compound pages which are at least order-2 (and hence have a
deferred_list), initialise it and then we can check at free that the page
is not part of a deferred list. We recently found this useful to rule out
a source of corruption.
[peterx@redhat.com: always initialise folio->_deferred_list]
Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com
Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The system will immediate fill up stack and crash when both
CONFIG_DEBUG_KMEMLEAK and CONFIG_MEM_ALLOC_PROFILING are enabled. Avoid
allocation tagging of kmemleak caches, otherwise recursive allocation
tracking occurs.
Link: https://lkml.kernel.org/r/20240425205516.work.220-kees@kernel.org
Fixes: 279bb991b4d9 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The /proc/allocinfo file exposes a tremendous about of information about
kernel build details, memory allocations (obviously), and potentially even
image layout (due to ordering). As this is intended to be consumed by
system owners (like /proc/slabinfo), use the same file permissions as
there: 0400.
Link: https://lkml.kernel.org/r/20240425200844.work.184-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Main goal of memory allocation profiling patchset is to provide accounting
that is cheap enough to run in production. To achieve that we inject
counters using codetags at the allocation call sites to account every time
allocation is made. This injection allows us to perform accounting
efficiently because injected counters are immediately available as opposed
to the alternative methods, such as using _RET_IP_, which would require
counter lookup and appropriate locking that makes accounting much more
expensive. This method requires all allocation functions to inject
separate counters at their call sites so that their callers can be
individually accounted. Counter injection is implemented by allocation
hooks which should wrap all allocation functions.
Inlined functions which perform allocations but do not use allocation
hooks are directly charged for the allocations they perform. In most
cases these functions are just specialized allocation wrappers used from
multiple places to allocate objects of a specific type. It would be more
useful to do the accounting at their call sites instead. Instrument these
helpers to do accounting at the call site. Simple inlined allocation
wrappers are converted directly into macros. More complex allocators or
allocators with documentation are converted into _noprof versions and
allocation hooks are added. This allows memory allocation profiling
mechanism to charge allocations to the callers of these functions.
Link: https://lkml.kernel.org/r/20240415020731.1152108-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Jan Kara <jack@suse.cz> [jbd2]
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.
Link: https://lkml.kernel.org/r/20240321163705.3067592-36-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.
Link: https://lkml.kernel.org/r/20240321163705.3067592-35-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.
Link: https://lkml.kernel.org/r/20240321163705.3067592-34-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>