mirror of
https://github.com/torvalds/linux.git
synced 2024-11-14 16:12:02 +00:00
abfb09e2c8
1122307 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Miaohe Lin
|
abfb09e2c8 |
hugetlb_cgroup: hugetlbfs: use helper macro SZ_1{K,M,G}
Use helper macro SZ_1K, SZ_1M and SZ_1G to do the size conversion. Minor readability improvement. Link: https://lkml.kernel.org/r/20220729080106.12752-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Miaohe Lin
|
862f7f6581 |
hugetlb_cgroup: remove unneeded nr_pages > 0 check
Patch series "A few cleanup patches for hugetlb_cgroup", v2. This series contains a few cleaup patches to remove unneeded check, use helper macro, remove unneeded return value and so on. More details can be found in the respective changelogs. This patch (of 5): When code reaches here, nr_pages must be > 0. Remove unneeded nr_pages > 0 check to simplify the code. Link: https://lkml.kernel.org/r/20220729080106.12752-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220729080106.12752-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Tarun Sahu
|
6f83d6c74e |
Kselftests: remove support of libhugetlbfs from kselftests
libhugetlbfs, the user side utitlity to work with hugepages, does not have any active support. There are only 2 selftests which are part of in vm/hmm_test.c that depends on libhugetlbfs. This patch modifies the tests so that they will not require libhugetlb library. [axelrasmussen@google.com: : remove orphaned references to local_config.{h,mk}] Link: https://lkml.kernel.org/r/20220831211526.2743216-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220801070231.13831-1-tsahu@linux.ibm.com Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Tested-by: Zach O'Keefe <zokeefe@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Shivaprasad G Bhat <sbhat@linux.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Imran Khan
|
b84e04f1ba |
kfence: add sysfs interface to disable kfence for selected slabs.
By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Haiyue Wang
|
8315682148 |
mm: migration: fix the FOLL_GET failure on following huge page
Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.
Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.
Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev
But these huge page APIs don't support FOLL_GET:
1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
It will cause WARN_ON_ONCE for FOLL_GET.
3. follow_huge_pgd() in mm/hugetlb.c
This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.
After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.
Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
Fixes:
|
||
Yang Yang
|
d3629af59f |
mm/vmscan: make the annotations of refaults code at the right place
After patch "mm/workingset: prepare the workingset detection
infrastructure for anon LRU", we can handle the refaults of anonymous
pages too. So the annotations of refaults should cover both of anonymous
pages and file pages.
Link: https://lkml.kernel.org/r/20220813080757.59131-1-yang.yang29@zte.com.cn
Fixes:
|
||
Kaixu Xia
|
4ed9824346 |
mm/damon/core: simplify the parameter passing for region split operation
The parameter 'struct damon_ctx *ctx' is unnecessary in damon region split operation, so we can remove it. Link: https://lkml.kernel.org/r/1660403943-29124-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yixuan Cao
|
57eb60c04d |
tools/vm/page_owner_sort: fix -f option
The -f option is to filter out the information of blocks whose memory has not been released, I noticed some blocks should not be filtered out. Commit |
||
Kairui Song
|
2fd86a07c9 |
mm/util: reduce stack usage of folio_mapcount
folio_test_hugetlb() will call PageHeadHuge which is a function call, and blocks the compiler from recognizing this redundant load. After rearranging the code, stack usage is dropped from 32 to 24, and the function size is smaller (tested on GCC 12): Before: Stack usage: mm/util.c:845:5:folio_mapcount 32 static Size: 0000000000000ea0 00000000000000c7 T folio_mapcount After: Stack usage: mm/util.c:845:5:folio_mapcount 24 static Size: 0000000000000ea0 00000000000000b0 T folio_mapcount Link: https://lkml.kernel.org/r/20220801173155.92008-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Abel Wu
|
e933dc4a07 |
mm/page_alloc: only search higher order when fallback
It seems unnecessary to search pages with order < alloc_order in fallback allocation. This can currently happen with ALLOC_NOFRAGMENT and alloc_order > pageblock_order, so add a test to prevent it. [vbabka@suse.cz: changelog addition] Link: https://lkml.kernel.org/r/20220803025121.47018-1-wuyun.abel@bytedance.com Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Li kunyu
|
97bab178e8 |
page_alloc: remove inactive initialization
The allocation address of the table pointer variable is first performed in the function, no initialization assignment is required, and no invalid pointer will appear. Link: https://lkml.kernel.org/r/20220803064118.3664-1-kunyu@nfschina.com Signed-off-by: Li kunyu <kunyu@nfschina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Feng Tang
|
d2226ebd54 |
mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit |
||
Abel Wu
|
12c1dc8e74 |
mm/mempolicy: fix lock contention on mems_allowed
The mems_allowed field can be modified by other tasks, so it isn't safe to
access it with alloc_lock unlocked even in the current process context.
Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
and B is changing cpusetA's cpuset.mems:
A (set_mempolicy) B (echo xx > cpuset.mems)
-------------------------------------------------------
pol = mpol_new();
update_tasks_nodemask(cpusetA) {
foreach t in cpusetA {
cpuset_change_task_nodemask(t) {
mpol_set_nodemask(pol) {
task_lock(t); // t could be A
new = f(A->mems_allowed);
update t->mems_allowed;
pol.create(pol, new);
task_unlock(t);
}
}
}
}
task_lock(A);
A->mempolicy = pol;
task_unlock(A);
In this case A's pol->nodes is computed by old mems_allowed, and could
be inconsistent with A's new mems_allowed.
While it is different when replacing vmas' policy: the pol->nodes is
gone wild only when current_cpuset_is_being_rebound():
A (mbind) B (echo xx > cpuset.mems)
-------------------------------------------------------
pol = mpol_new();
mmap_write_lock(A->mm);
cpuset_being_rebound = cpusetA;
update_tasks_nodemask(cpusetA) {
foreach t in cpusetA {
cpuset_change_task_nodemask(t) {
mpol_set_nodemask(pol) {
task_lock(t); // t could be A
mask = f(A->mems_allowed);
update t->mems_allowed;
pol.create(pol, mask);
task_unlock(t);
}
}
foreach v in A->mm {
if (cpuset_being_rebound == cpusetA)
pol.rebind(pol, cpuset.mems);
v->vma_policy = pol;
}
mmap_write_unlock(A->mm);
mmap_write_lock(t->mm);
mpol_rebind_mm(t->mm);
mmap_write_unlock(t->mm);
}
}
cpuset_being_rebound = NULL;
In this case, the cpuset.mems, which has already done updating, is finally
used for calculating pol->nodes, rather than A->mems_allowed. So it is OK
to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).
Link: https://lkml.kernel.org/r/20220811124157.74888-1-wuyun.abel@bytedance.com
Fixes:
|
||
Charan Teja Kalla
|
9a79443ddc |
mm/cma_debug: show complete cma name in debugfs directories
Currently only 12 characters of the cma name is being used as the debug directories where as the cma name can be of length CMA_MAX_NAME(=64) characters. One side problem with this is having 2 cma's with first common 12 characters would end up in trying to create directories with same name and fails with -EEXIST thus can limit cma debug functionality. The 'cma-' prefix is used initially where cma areas don't have any names and are represented by simple integer values. Since now each cma would be having its own name, drop 'cma-' prefix for the cma debug directories as they are clearly evident that they are for cma debug through creating them in /sys/kernel/debug/cma/ path. Link: https://lkml.kernel.org/r/1660223729-22461-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Christoph Hellwig
|
cf1e3fe497 |
mm/swap: remove the end_write_func argument to __swap_writepage
The argument is always set to end_swap_bio_write, so remove the argument and mark end_swap_bio_write static. Link: https://lkml.kernel.org/r/20220811141741.660214-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Alexey Romanov
|
f24263a5a0 |
zsmalloc: remove unnecessary size_class NULL check
pool->size_class array elements can't be NULL, so this check is not needed. In the whole code, we assign pool->size_class[i] values that are not NULL. Releasing memory for these values occurs in the zs_destroy_pool() function, which also releases and destroys the pool. In addition, in the zs_stats_size_show() and async_free_zspage(), with similar iterations over the array, we don't check it for NULL pointer. Link: https://lkml.kernel.org/r/20220811153755.16102-3-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Alexey Romanov
|
050a388b7f |
zsmalloc: zs_object_copy: add clarifying comment
Patch series "tidy up zsmalloc implementation" This patchset remove some unnecessary checks and adds a clarifying comment. While analysing zs_object_copy() function code, I spent some time to understand what the call kunmap_atomic(d_addr) is for. It seems that this point is not trivial and it is worth adding a comment. This patch (of 2): It's not obvious why kunmap_atomic(d_addr) call is needed. [akpm@linux-foundation.org: tweak comment layout] Link: https://lkml.kernel.org/r/20220811153755.16102-1-avromanov@sberdevices.ru Link: https://lkml.kernel.org/r/20220811153755.16102-2-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yang Yang
|
e9c2dbc8bf |
mm/vmscan: define macros for refaults in struct lruvec
The magic number 0 and 1 are used in several places in vmscan.c. Define macros for them to improve code readability. Link: https://lkml.kernel.org/r/20220808005644.1721066-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Axel Rasmussen
|
4a7e922587 |
selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
This new mode was recently added to the userfaultfd selftest. We want to exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both test cases to the script. Link: https://lkml.kernel.org/r/20220808175614.3885028-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Axel Rasmussen
|
816284a3d0 |
userfaultfd: update documentation to describe /dev/userfaultfd
Explain the different ways to create a new userfaultfd, and how access control works for each way. [axelrasmussen@google.com: improve wording in documentation, per Mike] Link: https://lkml.kernel.org/r/20220819205201.658693-5-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220808175614.3885028-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Axel Rasmussen
|
77c07f7cca |
userfaultfd: selftests: modify selftest to use /dev/userfaultfd
We clearly want to ensure both userfaultfd(2) and /dev/userfaultfd keep working into the future, so just run the test twice, using each interface. Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let the user choose which to test. As with other test features, change the behavior based on a new command line flag. Introduce the idea of "test mods", which are generic (not specific to a test type) modifications to the behavior of the test. This is sort of borrowed from this RFC patch series [1], but simplified a bit. The benefit is, in "typical" configurations this test is somewhat slow (say, 30sec or something). Testing both clearly doubles it, so it may not always be desirable, as users are likely to use one or the other, but never both, in the "real world". [1]: https://patchwork.kernel.org/project/linux-mm/patch/20201129004548.1619714-14-namit@vmware.com/ [axelrasmussen@google.com: modify selftest to exit with KSFT_SKIP *only* when features are unsupported, per Mike] Link: https://lkml.kernel.org/r/20220819205201.658693-4-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220808175614.3885028-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Axel Rasmussen
|
2d5de004e0 |
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in
|
||
Axel Rasmussen
|
a722d70508 |
selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
Patch series "userfaultfd: add /dev/userfaultfd for fine grained access control", v7. Why not ...? ============ - Why not /proc/[pid]/userfaultfd? Two main points (additional discussion [1]): - /proc/[pid]/* files are all owned by the user/group of the process, and they don't really support chmod/chown. So, without extending procfs it doesn't solve the problem this series is trying to solve. - The main argument *for* this was to support creating UFFDs for remote processes. But, that use case clearly calls for CAP_SYS_PTRACE, so to support this we could just use the UFFD syscall as-is. - Why not use a syscall? Access to syscalls is generally controlled by capabilities. We don't have a capability which is used for userfaultfd access without also granting more / other permissions as well, and adding a new capability was rejected [2]. - It's possible a LSM could be used to control access instead, but I have some concerns. I don't think this approach would be as easy to use, particularly if we were to try to solve this with something heavyweight like SELinux. Maybe we could pursue adding a new LSM specifically for this user case, but it may be too narrow of a case to justify that. [1]: https://patchwork.kernel.org/project/linux-mm/cover/20220719195628.3415852-1-axelrasmussen@google.com/ [2]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-170e5943abe4@schaufler-ca.com/T/ This patch (of 5): This not being included was just a simple oversight. There are certain features (like minor fault support) which are only enabled on shared mappings, so without including hugetlb_shared we actually lose a significant amount of test coverage. Link: https://lkml.kernel.org/r/20220808175614.3885028-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220808175614.3885028-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kenneth Lee
|
b2d4c646d5 |
mm/damon/dbgfs: use kmalloc for allocating only one element
Use kmalloc(...) rather than kmalloc_array(1, ...) because the number of elements we are specifying in this case is 1, kmalloc would accomplish the same thing and we can simplify. Link: https://lkml.kernel.org/r/20220808220019.1680469-1-klee33@uw.edu Signed-off-by: Kenneth Lee <klee33@uw.edu> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Shaoqin Huang
|
223ce4910b |
mm/filemap.c: convert page_endio() to use a folio
Replace three calls to compound_head() with one. Link: https://lkml.kernel.org/r/20220809023256.178194-1-shaoqin.huang@intel.com Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
2ace36f0f5 |
mm: memory-failure: cleanup try_to_split_thp_page()
Since commit |
||
Rik van Riel
|
f35b5d7d67 |
mm: align larger anonymous mappings on THP boundaries
Align larger anonymous memory mappings on THP boundaries by going through thp_get_unmapped_area if THPs are enabled for the current process. With this patch, larger anonymous mappings are now THP aligned. When a malloc library allocates a 2MB or larger arena, that arena can now be mapped with THPs right from the start, which can result in better TLB hit rates and execution time. Link: https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Charan Teja Kalla
|
7b5a0b664e |
mm/page_ext: remove unused variable in offline_page_ext
Remove unused variable 'nid' in offline_page_ext(). This is not used since the page_ext code inception. Link: https://lkml.kernel.org/r/1659330397-11817-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
9d0d946840 |
selftests/vm: add selftest to verify multi THP collapse
Add support to allocate and verify collapse of multiple hugepage-sized regions into multiple THPs. Add "nr" argument to check_huge() that instructs check_huge() to check for exactly "nr_hpages" THPs. This has the added benefit of now being able to check for exactly 0 THPs, and so callsites that previously checked the negation of exactly 1 THP are now more correct. ->collapse struct collapse_context hook has been expanded with a "nr_hpages" argument to collapse "nr_hpages" hugepages. The collapse_full() test has been repurposed to collapse 4 THPs at once. It is expected more tests will want to test multi THP collapse (e.g. file/shmem). This is of particular benefit to madvise collapse context given that it may do many THP collapses during a single syscall. Link: https://lkml.kernel.org/r/20220706235936.2197195-19-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
1370a21fe4 |
selftests/vm: add selftest to verify recollapse of THPs
Add selftest specific to madvise collapse context that tests MADV_COLLAPSE is "successful" if a hugepage-aligned/sized region is already pmd-mapped. This test also verifies that MADV_COLLAPSE can collapse memory into THPs even in "madvise" THP mode and the memory isn't marked VM_HUGEPAGE. Link: https://lkml.kernel.org/r/20220706235936.2197195-18-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
9330694de5 |
selftests/vm: add MADV_COLLAPSE collapse context to selftests
Add madvise collapse context to hugepage collapse selftests. This context is tested with /sys/kernel/mm/transparent_hugepage/enabled set to "never" in order to avoid unwanted interaction with khugepaged during testing. Also, refactor updates to sysfs THP settings using a stack so that the THP settings from nested callers can be restored. Link: https://lkml.kernel.org/r/20220706235936.2197195-17-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
be6667b0db |
selftests/vm: dedup hugepage allocation logic
The code p = alloc_mapping(); printf("Allocate huge page..."); madvise(p, hpage_pmd_size, MADV_HUGEPAGE); fill_memory(p, 0, hpage_pmd_size); if (check_huge(p)) success("OK"); else fail("Fail"); Is repeated many times in different tests. Add a helper, alloc_hpage() to handle this. Link: https://lkml.kernel.org/r/20220706235936.2197195-16-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
61c2c6764e |
selftests/vm: modularize collapse selftests
Modularize the collapse action of khugepaged collapse selftests by introducing a struct collapse_context which specifies how to collapse a given memory range and the expected semantics of the collapse. This can be reused later to test other collapse contexts. Additionally, all tests have logic that checks if a collapse occurred via reading /proc/self/smaps, and report if this is different than expected. Move this logic into the per-context ->collapse() hook instead of repeating it in every test. Link: https://lkml.kernel.org/r/20220706235936.2197195-15-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
876b4a1896 |
mm/madvise: add MADV_COLLAPSE to process_madvise()
Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has CAP_SYS_ADMIN or is requesting collapse of it's own memory. This is useful for the development of userspace agents that seek to optimize THP utilization system-wide by using userspace signals to prioritize what memory is most deserving of being THP-backed. [zokeefe@google.com: remove CAP_SYS_ADMIN requirement for process_madvise(MADV_COLLAPSE)] Link: https://lkml.kernel.org/r/20220801210946.3069083-1-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-13-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
7d2c4385c3 |
mm/khugepaged: rename prefix of shared collapse functions
The following functions are shared between khugepaged and madvise collapse contexts. Replace the "khugepaged_" prefix with generic "hpage_collapse_" prefix in such cases: khugepaged_test_exit() -> hpage_collapse_test_exit() khugepaged_scan_abort() -> hpage_collapse_scan_abort() khugepaged_scan_pmd() -> hpage_collapse_scan_pmd() khugepaged_find_target_node() -> hpage_collapse_find_target_node() khugepaged_alloc_page() -> hpage_collapse_alloc_page() The kerenel ABI (e.g. huge_memory:mm_khugepaged_scan_pmd tracepoint) is unaltered. Link: https://lkml.kernel.org/r/20220706235936.2197195-11-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
7d8faaf155 |
mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
This idea was introduced by David Rientjes[1]. Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory at their own expense. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, process_madvise(2) returns the number of bytes advised, and madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Most notable here is ENOMEM and EBUSY (new to madvise) which are intended to provide the caller with actionable feedback so they may take an appropriate fallback measure. Use Cases An immediate user of this new functionality are malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[2]. Only privately-mapped anon memory is supported for now, but additional support for file, shmem, and HugeTLB high-granularity mappings[2] is expected. File and tmpfs/shmem support would permit: * Backing executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. * Backing guest memory by hugapages after the memory contents have been migrated in native-page-sized chunks to a new host, in a userfaultfd-based live-migration stack. [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ [2] https://github.com/google/tcmalloc/tree/master/tcmalloc [jrdr.linux@gmail.com: avoid possible memory leak in failure path] Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com [zokeefe@google.com add missing kfree() to madvise_collapse()] Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/ Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com [zokeefe@google.com: delay computation of hpage boundaries until use]] Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Signed-off-by: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
5072280442 |
mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage
When scanning an anon pmd to see if it's eligible for collapse, return SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the file-collapse path, since the latter might identify pte-mapped compound pages. This is required by MADV_COLLAPSE which necessarily needs to know what hugepage-aligned/sized regions are already pmd-mapped. In order to determine if a pmd already maps a hugepage, refactor mm_find_pmd(): Return mm_find_pmd() to it's pre-commit |
||
Zach O'Keefe
|
a7f4e6e4c4 |
mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()
MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1]. hugepage_vma_check() is the authority on determining if a VMA is eligible for THP allocation/collapse, and currently enforces the sysfs THP settings. Add a flag to disable these checks. For now, only apply this arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. We can expand this to shmem, which uses /sys/kernel/transparent_hugepage/shmem_enabled, later. Use this flag in collapse_pte_mapped_thp() where previously the VMA flags passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check THP flag in hugepage_vma_check()", this check also didn't check "never" THP mode. As such, this restores the previous behavior of collapse_pte_mapped_thp() where sysfs THP settings are ignored. See comment in code for justification why this is OK. [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
d8ea7cc854 |
mm/khugepaged: add flag to predicate khugepaged-only behavior
Add .is_khugepaged flag to struct collapse_control so khugepaged-specific behavior can be elided by MADV_COLLAPSE context. Start by protecting khugepaged-specific heuristics by this flag. In MADV_COLLAPSE, the user presumably has reason to believe the collapse will be beneficial and khugepaged heuristics shouldn't prevent the user from doing so: 1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared] 2) requirement that some pages in region being collapsed be young or referenced [zokeefe@google.com: consistently order cc->is_khugepaged and pte_* checks] Link: https://lkml.kernel.org/r/20220720140603.1958773-3-zokeefe@google.com Link: https://lore.kernel.org/linux-mm/Ys2qJm6FaOQcxkha@google.com/ Link: https://lkml.kernel.org/r/20220706235936.2197195-7-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
50ad2f24b3 |
mm/khugepaged: propagate enum scan_result codes back to callers
Propagate enum scan_result codes back through return values of functions downstream of khugepaged_scan_file() and khugepaged_scan_pmd() to inform callers if the operation was successful, and if not, why. Since khugepaged_scan_pmd()'s return value already has a specific meaning (whether mmap_lock was unlocked or not), add a bool* argument to khugepaged_scan_pmd() to retrieve this information. Change khugepaged to take action based on the return values of khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep within the collapsing functions themselves. hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more consistent with enum scan_result propagation. Remove dependency on error pointers to communicate to khugepaged that allocation failed and it should sleep; instead just use the result of the scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails). Link: https://lkml.kernel.org/r/20220706235936.2197195-6-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
9710a78ab2 |
mm/khugepaged: dedup and simplify hugepage alloc and charging
The following code is duplicated in collapse_huge_page() and collapse_file(): gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; new_page = khugepaged_alloc_page(hpage, gfp, node); if (!new_page) { result = SCAN_ALLOC_HUGE_PAGE_FAIL; goto out; } if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out; } count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC); Also, "node" is passed as an argument to both collapse_huge_page() and collapse_file() and obtained the same way, via khugepaged_find_target_node(). Move all this into a new helper, alloc_charge_hpage(), and remove the duplicate code from collapse_huge_page() and collapse_file(). Also, simplify khugepaged_alloc_page() by returning a bool indicating allocation success instead of a copy of the allocated struct page *. Link: https://lkml.kernel.org/r/20220706235936.2197195-5-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Suggested-by: Peter Xu <peterx@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zach O'Keefe
|
34d6b470ab |
mm/khugepaged: add struct collapse_control
Modularize hugepage collapse by introducing struct collapse_control. This structure serves to describe the properties of the requested collapse, as well as serve as a local scratch pad to use during the collapse itself. Start by moving global per-node khugepaged statistics into this new structure. Note that this structure is still statically allocated since CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a MAX_NUMNODES-sized array could cause -Wframe-large-than= errors. [zokeefe@google.com: use minimal bits to store num page < HPAGE_PMD_NR] Link: https://lkml.kernel.org/r/20220720140603.1958773-2-zokeefe@google.com Link: https://lore.kernel.org/linux-mm/Ys2CeIm%2FQmQwWh9a@google.com/ [sfr@canb.auug.org.au: fix build] Link: https://lkml.kernel.org/r/20220721195508.15f1e07a@canb.auug.org.au [zokeefe@google.com: fix struct collapse_control load_node definition] Link: https://lore.kernel.org/linux-mm/202209021349.F73i5d6X-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220903021221.1130021-1-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-4-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yang Shi
|
c6a7f445a2 |
mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
Patch series "mm: userspace hugepage collapse", v7.
Introduction
--------------------------------
This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.
This idea was introduced by David Rientjes[5].
Interface
--------------------------------
The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.
process_madvise(2)
Performs a synchronous collapse of the native pages
mapped by the list of iovecs into transparent hugepages.
This operation is independent of the system THP sysfs settings,
but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.
THP allocation may enter direct reclaim and/or compaction.
When a range spans multiple VMAs, the semantics of the collapse
over of each VMA is independent from the others.
Caller must have CAP_SYS_ADMIN if not acting on self.
Return value follows existing process_madvise(2) conventions. A
“success” indicates that all hugepage-sized/aligned regions
covered by the provided range were either successfully
collapsed, or were already pmd-mapped THPs.
madvise(2)
Equivalent to process_madvise(2) on self, with 0 returned on
“success”.
Current Use-Cases
--------------------------------
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. With MADV_COLLAPSE, we get the best of both
worlds: Peak upfront performance and lower RAM footprints. Note
that subsequent support for file-backed memory is required here.
(2) malloc() implementations that manage memory in hugepage-sized
chunks, but sometimes subrelease memory back to the system in
native-sized chunks via MADV_DONTNEED; zapping the pmd. Later,
when the memory is hot, the implementation could
madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance. TCMalloc is such an
implementation that could benefit from this[6]. A prior study of
Google internal workloads during evaluation of Temeraire, a
hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
all cpu cycles were spent in dTLB stalls, and that increasing
hugepage coverage by even small amount can help with that[7].
(3) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage). However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance. Note that
subsequent support for file/shmem-backed memory is required here.
(4) HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
be mapped at different levels in the page tables[8]. As it's not
"transparent" like THP, HugeTLB high-granularity mappings require
an explicit user API. It is intended that MADV_COLLAPSE be co-opted
for this use case[9]. Note that subsequent support for HugeTLB
memory is required here.
Future work
--------------------------------
Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.
One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps. For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[10].
This patch (of 17):
The khugepaged has optimization to reduce huge page allocation calls for
!CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
the next loop. CONFIG_NUMA doesn't do so since the next loop may try to
collapse huge page from a different node, so it doesn't make too much
sense to carry it.
But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
before scanning the address space, so it means huge page may be allocated
even though there is no suitable range for collapsing. Then the page
would be just freed if khugepaged already made enough progress. This
could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
run. This problem actually makes things worse due to the way more
pointless THP allocations and makes the optimization pointless.
This could be fixed by carrying the huge page across scans, but it will
complicate the code further and the huge page may be carried indefinitely.
But if we take one step back, the optimization itself seems not worth
keeping nowadays since:
* Not too many users build NUMA=n kernel nowadays even though the kernel is
actually running on a non-NUMA machine. Some small devices may run NUMA=n
kernel, but I don't think they actually use THP.
* Since commit
|
||
Linus Torvalds
|
b90cb10531 | Linux 6.0-rc3 | ||
Linus Torvalds
|
b467192ec7 |
Seventeen hotfixes. Mostly memory management things. Ten patches are
cc:stable, addressing pre-6.0 issues. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq pRytDBObGADxlA67c9eNq6J5pCT/7gE= =pD67 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more hotfixes from Andrew Morton: "Seventeen hotfixes. Mostly memory management things. Ten patches are cc:stable, addressing pre-6.0 issues" * tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: .mailmap: update Luca Ceresoli's e-mail address mm/mprotect: only reference swap pfn page if type match squashfs: don't call kmalloc in decompressors mm/damon/dbgfs: avoid duplicate context directory creation mailmap: update email address for Colin King asm-generic: sections: refactor memory_intersects bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown Revert "memcg: cleanup racy sum avoidance code" mm/zsmalloc: do not attempt to free IS_ERR handle binder_alloc: add missing mmap_lock calls when using the VMA mm: re-allow pinning of zero pfns (again) vmcoreinfo: add kallsyms_num_syms symbol mailmap: update Guilherme G. Piccoli's email addresses writeback: avoid use-after-free after removing device shmem: update folio if shmem_replace_page() updates the page mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte |
||
Linus Torvalds
|
373eff576e |
bitmap fixes for v6.0-rc3
Hi Linus, Please pull (hopefully) the last portion of fixes from Sander for his UP rework series. The original series came from -mm tree, and it was not the latest version, that's why we need follow-ups. It fixes only a test introduced by that series. The test fails under certain configs. From Sander: This series fixes the reported issues, and implements the suggested improvements, for the version of the cpumask tests [1] that was merged with commit |
||
Luca Ceresoli
|
0ebafe2ea8 |
.mailmap: update Luca Ceresoli's e-mail address
My Bootlin address is preferred from now on. Link: https://lkml.kernel.org/r/20220826130515.3011951-1-luca.ceresoli@bootlin.com Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Atish Patra <atishp@atishpatra.org> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
3d2f78f08c |
mm/mprotect: only reference swap pfn page if type match
Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to
fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]:
kernel BUG at include/linux/swapops.h:117!
CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L 6.0.0-dbg-DEV #2
RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
change_pte_range+0x36e/0x880
change_p4d_range+0x2e8/0x670
change_protection_range+0x14e/0x2c0
mprotect_fixup+0x1ee/0x330
do_mprotect_pkey+0x34c/0x440
__x64_sys_mprotect+0x1d/0x30
It triggers because pfn_swap_entry_to_page() could be called upon e.g. a
genuine swap entry.
Fix it by only calling it when it's a write migration entry where the page*
is used.
[1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com
Fixes:
|
||
Phillip Lougher
|
1f13dff09f |
squashfs: don't call kmalloc in decompressors
The decompressors may be called while in an atomic section. So move the kmalloc() out of this path, and into the "page actor" init function. This fixes a regression introduced by commit |
||
Badari Pulavarty
|
d26f607036 |
mm/damon/dbgfs: avoid duplicate context directory creation
When user tries to create a DAMON context via the DAMON debugfs interface
with a name of an already existing context, the context directory creation
fails but a new context is created and added in the internal data
structure, due to absence of the directory creation success check. As a
result, memory could leak and DAMON cannot be turned on. An example test
case is as below:
# cd /sys/kernel/debug/damon/
# echo "off" > monitor_on
# echo paddr > target_ids
# echo "abc" > mk_context
# echo "abc" > mk_context
# echo $$ > abc/target_ids
# echo "on" > monitor_on <<< fails
Return value of 'debugfs_create_dir()' is expected to be ignored in
general, but this is an exceptional case as DAMON feature is depending
on the debugfs functionality and it has the potential duplicate name
issue. This commit therefore fixes the issue by checking the directory
creation failure and immediately return the error in the case.
Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org
Fixes:
|