No functional change.
Link: https://lkml.kernel.org/r/20221024062012.1520887-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These interfaces will be used by drivers/base/memory.c by later patch, so
as a preparatory work move them to more common header file visible to the
file.
Link: https://lkml.kernel.org/r/20221024062012.1520887-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm, hwpoison: improve handling workload related to hugetlb
and memory_hotplug", v7.
This patchset tries to solve the issue among memory_hotplug, hugetlb and hwpoison.
In this patchset, memory hotplug handles hwpoison pages like below:
- hwpoison pages should not prevent memory hotremove,
- memory block with hwpoison pages should not be onlined.
This patch (of 4):
HWPoisoned page is not supposed to be accessed once marked, but currently
such accesses can happen during memory hotremove because
do_migrate_range() can be called before dissolve_free_huge_pages() is
called.
Clear HPageMigratable for hwpoisoned hugepages to prevent them from being
migrated. This should be done in hugetlb_lock to avoid race against
isolate_hugetlb().
get_hwpoison_huge_page() needs to have a flag to show it's called from
unpoison to take refcount of hwpoisoned hugepages, so add it.
[naoya.horiguchi@linux.dev: remove TestClearHPageMigratable and reduce to test and clear separately]
Link: https://lkml.kernel.org/r/20221025053559.GA2104800@ik1-406-35019.vs.sakura.ne.jp
Link: https://lkml.kernel.org/r/20221024062012.1520887-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20221024062012.1520887-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When creating a virtual machine, we will use memfd_create() to get a file
descriptor which can be used to create share memory mappings using the
mmap function, meanwhile the mmap() will set the MAP_POPULATE flag to
allocate physical pages for the virtual machine.
When allocating physical pages for the guest, the host can fallback to
allocate some CMA pages for the guest when over half of the zone's free
memory is in the CMA area.
In guest os, when the application wants to do some data transaction with
DMA, our QEMU will call VFIO_IOMMU_MAP_DMA ioctl to do longterm-pin and
create IOMMU mappings for the DMA pages. However, when calling
VFIO_IOMMU_MAP_DMA ioctl to pin the physical pages, we found it will be
failed to longterm-pin sometimes.
After some invetigation, we found the pages used to do DMA mapping can
contain some CMA pages, and these CMA pages will cause a possible failure
of the longterm-pin, due to failed to migrate the CMA pages. The reason
of migration failure may be temporary reference count or memory allocation
failure. So that will cause the VFIO_IOMMU_MAP_DMA ioctl returns error,
which makes the application failed to start.
I observed one migration failure case (which is not easy to reproduce) is
that, the 'thp_migration_fail' count is 1 and the 'thp_split_page_failed'
count is also 1.
That means when migrating a THP which is in CMA area, but can not allocate
a new THP due to memory fragmentation, so it will split the THP. However
THP split is also failed, probably the reason is temporary reference count
of this THP. And the temporary reference count can be caused by dropping
page caches (I observed the drop caches operation in the system), but we
can not drop the shmem page caches due to they are already dirty at that
time.
Especially for THP split failure, which is caused by temporary reference
count, we can try again to mitigate the failure of migration in this case
according to previous discussion [1].
[1] https://lore.kernel.org/all/470dc638-a300-f261-94b4-e27250e42f96@redhat.com/
Link: https://lkml.kernel.org/r/6784730480a1df82e8f4cba1ed088e4ac767994b.1666599848.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With " mm/uffd: Fix vma check on userfault for wp" to fix the
registration, we'll be safe to remove the macro hacks now.
Link: https://lkml.kernel.org/r/20221024193336.1233616-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pass pfn/flags to put_ref_page(), then check MF_COUNT_INCREASED and drop
refcount to make the code look cleaner.
Link: https://lkml.kernel.org/r/20221021084611.53765-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A trivial cleanup to move clearing of RestoreReserve into adding anon rmap
of private hugetlb mappings. It matches with the shared mappings where we
only clear the bit when adding into page cache, rather than spreading it
around the code paths.
Link: https://lkml.kernel.org/r/20221020193832.776173-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is no longer used; all callers have been converted to use folios
instead. Somehow this manages to save 11 bytes of text.
Link: https://lkml.kernel.org/r/20221019183332.2802139-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Return the containing folio instead of the precise page. One of the
callers wants the folio and the other can do the folio->page conversion
itself. Nets 442 bytes of text size reduction, 478 bytes removed and 36
bytes added.
Link: https://lkml.kernel.org/r/20221019183332.2802139-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is for debug purposes and is called when a vmap area gets freed. This
event gives some indication about:
- a start address of released area;
- a current number of outstanding pages;
- a maximum number of allowed outstanding pages.
Link: https://lkml.kernel.org/r/20221018181053.434508-7-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is for debug purposes and is called when all outstanding areas are
removed back to the vmap space. It gives some extra information about:
- a start:end range where set of vmap ares were freed;
- a number of purged areas which were backed off.
[urezki@gmail.com: simplify return boolean expression]
Link: https://lkml.kernel.org/r/20221020125247.5053-1-urezki@gmail.com
Link: https://lkml.kernel.org/r/20221018181053.434508-6-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is for debug purpose and is called when an allocation attempt occurs.
This event gives some information about:
- start address of allocated area;
- size that is requested;
- alignment that is required;
- vstart/vend restriction;
- if an allocation fails.
Link: https://lkml.kernel.org/r/20221018181053.434508-5-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The priority of hotplug memory callback is defined in a different file.
And there are some callers using numbers directly. Collect them together
into include/linux/memory.h for easy reading. This allows us to sort
their priorities more intuitively without additional comments.
Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 76ae847497 ("Documentation: raise minimum supported version of
GCC to 5.1") updated the minimum gcc version to 5.1. So the problem
mentioned in f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") no longer exist. So we can now switch to
use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().
Link: https://lkml.kernel.org/r/20220923033347.3935160-6-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 76ae847497 ("Documentation: raise minimum supported version of
GCC to 5.1") updated the minimum gcc version to 5.1. So the problem
mentioned in f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") no longer exist. So we can now switch to
use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().
Link: https://lkml.kernel.org/r/20220923033347.3935160-5-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 76ae847497 ("Documentation: raise minimum supported version of
GCC to 5.1") updated the minimum gcc version to 5.1. So the problem
mentioned in f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") no longer exist. So we can now switch to
use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().
Link: https://lkml.kernel.org/r/20220923033347.3935160-4-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 2f031c6f04 ("mm/rmap: Convert rmap_walk() to take a
folio"), page_not_mapped() takes folio as parameter, rename it to be
consistent.
Link: https://lkml.kernel.org/r/20220927063826.159590-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We want an easy way to take a R/O or R/W longterm pin on a range and be
able to observe the content of the pinned pages, so we can properly test
how longterm puns interact with our COW logic.
[david@redhat.com: silence a warning on 32-bit]
Link: https://lkml.kernel.org/r/74adbb51-6e33-f636-8a9c-2ad87bd9007e@redhat.com
[yang.lee@linux.alibaba.com: ./mm/gup_test.c:281:2-3: Unneeded semicolon]
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2455
Link: https://lkml.kernel.org/r/20221020024035.113619-1-yang.lee@linux.alibaba.com
Link: https://lkml.kernel.org/r/20220927110120.106906-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Migrate the workqueue_uaf test to the KUnit framework.
Initially, this test was intended to check that Generic KASAN prints
auxiliary stack traces for workqueues. Nevertheless, the test is enabled
for all modes to make that KASAN reports bad accesses in the tested
scenario.
The presence of auxiliary stack traces for the Generic mode needs to be
inspected manually.
Link: https://lkml.kernel.org/r/1d81b6cc2a58985126283d1e0de8e663716dd930.1664298455.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Migrate the kasan_rcu_uaf test to the KUnit framework.
Changes to the implementation of the test:
- Call rcu_barrier() after call_rcu() to make that the RCU callbacks get
triggered before the test is over.
- Cast pointer passed to rcu_dereference_protected as __rcu to get rid of
the Sparse warning.
- Check that KASAN prints a report via KUNIT_EXPECT_KASAN_FAIL.
Initially, this test was intended to check that Generic KASAN prints
auxiliary stack traces for RCU objects. Nevertheless, the test is enabled
for all modes to make that KASAN reports bad accesses in RCU callbacks.
The presence of auxiliary stack traces for the Generic mode needs to be
inspected manually.
Link: https://lkml.kernel.org/r/897ee08d6cd0ba7e8a4fbfd9d8502823a2f922e6.1664298455.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch KUnit-compatible KASAN tests from using per-task KUnit resources to
console tracepoints.
This allows for two things:
1. Migrating tests that trigger a KASAN report in the context of a task
other than current to KUnit framework.
This is implemented in the patches that follow.
2. Parsing and matching the contents of KASAN reports.
This is not yet implemented.
Link: https://lkml.kernel.org/r/9345acdd11e953b207b0ed4724ff780e63afeb36.1664298455.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace the checks for memcg is root memcg, with mem_cgroup_is_root()
helper.
Link: https://lkml.kernel.org/r/20220930134433.338103-1-kamalesh.babulal@oracle.com
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tom Hromatka <tom.hromatka@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Using vma_lookup() verifies the start address is contained in the found
vma. This results in easier to read the code.
Link: https://lkml.kernel.org/r/20221007030345.5029-1-wangdeming@inspur.com
Signed-off-by: Deming Wang <wangdeming@inspur.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After the rework of shmem_get_folio_gfp() to use a folio, the local
variable hindex is only needed to be set once before passing it to
shmem_add_to_page_cache().
Remove the unneeded initialization and assignments of the variable hindex
before the actual effective assignment and first use.
No functional change. No change in object code.
Link: https://lkml.kernel.org/r/20221007085027.6309-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Initially, find_get_entries() was being passed in the start offset as a
value. That left the calculation of the offset to the callers. This led
to complexity in the callers trying to keep track of the index.
Now find_get_entries() takes in a pointer to the start offset and updates
the value to be directly after the last entry found. If no entry is
found, the offset is not changed. This gets rid of multiple hacky
calculations that kept track of the start offset.
Link: https://lkml.kernel.org/r/20221017161800.2003-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Rework find_get_entries() and find_lock_entries()", v3.
Originally the callers of find_get_entries() and find_lock_entries() were
keeping track of the start index themselves as they traverse the search
range.
This resulted in hacky code such as in shmem_undo_range():
index = folio->index + folio_nr_pages(folio) - 1;
where the - 1 is only present to stay in the right spot after incrementing
index later. This sort of calculation was also being done on every folio
despite not even using index later within that function.
These patches change find_get_entries() and find_lock_entries() to
calculate the new index instead of leaving it to the callers so we can
avoid all these complications.
This patch (of 2):
Initially, find_lock_entries() was being passed in the start offset as a
value. That left the calculation of the offset to the callers. This led
to complexity in the callers trying to keep track of the index.
Now find_lock_entries() takes in a pointer to the start offset and updates
the value to be directly after the last entry found. If no entry is
found, the offset is not changed. This gets rid of multiple hacky
calculations that kept track of the start offset.
Link: https://lkml.kernel.org/r/20221017161800.2003-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20221017161800.2003-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 2555283eb4 ("mm/rmap: Fix anon_vma->degree ambiguity leading to
double-reuse") use num_children and num_active_vmas to replace the origin
degree to fix anon_vma UAF problem. Update the comment in anon_vma_clone
to fit this change.
Link: https://lkml.kernel.org/r/20221014013931.1565969-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Helper function to retrieve hstate information from a hugetlb folio.
Link: https://lkml.kernel.org/r/20220922154207.1575343-6-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove the last caller of delete_from_page_cache() by converting the code
to its folio equivalent.
Link: https://lkml.kernel.org/r/20220922154207.1575343-5-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We noticed a 2% webserver throughput regression after upgrading from 5.6.
This could be tracked down to a shift in the anon/file reclaim balance
(confirmed with swappiness) that resulted in worse reclaim efficiency and
thus more kswapd activity for the same outcome.
The change that exposed the problem is aae466b005 ("mm/swap: implement
workingset detection for anonymous LRU"). By qualifying swapins based on
their refault distance, it lowered the cost of anon reclaim in this
workload, in turn causing (much) more anon scanning than before. Scanning
the anon list is more expensive due to the higher ratio of mmapped pages
that may rotate during reclaim, and so the result was an increase in %sys
time.
Right now, rotations aren't considered a cost when balancing scan pressure
between LRUs. We can end up with very few file refaults putting all the
scan pressure on hot anon pages that are rotated en masse, don't get
reclaimed, and never push back on the file LRU again. We still only
reclaim file cache in that case, but we burn a lot CPU rotating anon
pages. It's "fair" from an LRU age POV, but doesn't reflect the real cost
it imposes on the system.
Consider rotations as a secondary factor in balancing the LRUs. This
doesn't attempt to make a precise comparison between IO cost and CPU cost,
it just says: if reloads are about comparable between the lists, or
rotations are overwhelmingly different, adjust for CPU work.
This fixed the regression on our webservers. It has since been deployed
to the entire Meta fleet and hasn't caused any problems.
Link: https://lkml.kernel.org/r/20221013193113.726425-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During discussions of this series [1], it was suggested that hugetlb
handling code in follow_page_mask could be simplified. At the beginning
of follow_page_mask, there currently is a call to follow_huge_addr which
'may' handle hugetlb pages. ia64 is the only architecture which provides
a follow_huge_addr routine that does not return error. Instead, at each
level of the page table a check is made for a hugetlb entry. If a hugetlb
entry is found, a call to a routine associated with that entry is made.
Currently, there are two checks for hugetlb entries at each page table
level. The first check is of the form:
if (p?d_huge())
page = follow_huge_p?d();
the second check is of the form:
if (is_hugepd())
page = follow_huge_pd().
We can replace these checks, as well as the special handling routines such
as follow_huge_p?d() and follow_huge_pd() with a single routine to handle
hugetlb vmas.
A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the
beginning of follow_page_mask. hugetlb_follow_page_mask will use the
existing routine huge_pte_offset to walk page tables looking for hugetlb
entries. huge_pte_offset can be overwritten by architectures, and already
handles special cases such as hugepd entries.
[1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.wang@linux.alibaba.com/
[mike.kravetz@oracle.com: remove vma (pmd sharing) per Peter]
Link: https://lkml.kernel.org/r/20221028181108.119432-1-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: remove left over hugetlb_vma_unlock_read()]
Link: https://lkml.kernel.org/r/20221030225825.40872-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220919021348.22151-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When using the VMA iterator, the final execution will set the variable
'next' to NULL which causes the function to fail out. Restore the break
in the loop to exit the VMA iterator early without clearing NULL fixes the
issue.
Link: https://lore.kernel.org/lkml/29344.1666681759@jrobl/
Link: https://lkml.kernel.org/r/20221025161222.2634030-1-Liam.Howlett@oracle.com
Fixes: 763ecb0350 (mm: remove the vma linked list)
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: "J. R. Okajima" <hooanon05g@gmail.com>
Tested-by: "J. R. Okajima" <hooanon05g@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel test robot flagged a recursive lock as a result of a conversion
from kmap_atomic() to kmap_local_folio()[Link]
The cause was due to the code depending on the kmap_atomic() side effect
of disabling page faults. In that case the code expects the fault to fail
and take the fallback case.
git archaeology implied that the recursion may not be an actual bug.[1]
However, depending on the implementation of the mmap_lock and the
condition of the call there may still be a deadlock.[2] So this is not
purely a lockdep issue. Considering a single threaded call stack there
are 3 options.
1) Different mm's are in play (no issue)
2) Readlock implementation is recursive and same mm is in play
(no issue)
3) Readlock implementation is _not_ recursive (issue)
The mmap_lock is recursive so with a single thread there is no issue.
However, Matthew pointed out a deadlock scenario when you consider
additional process' and threads thusly.
"The readlock implementation is only recursive if nobody else has taken a
write lock. If you have a multithreaded process, one of the other threads
can call mmap() and that will prevent recursion (due to fairness). Even
if it's a different process that you're trying to acquire the mmap read
lock on, you can still get into a deadly embrace. eg:
process A thread 1 takes read lock on own mmap_lock
process A thread 2 calls mmap, blocks taking write lock
process B thread 1 takes page fault, read lock on own mmap lock
process B thread 2 calls mmap, blocks taking write lock
process A thread 1 blocks taking read lock on process B
process B thread 1 blocks taking read lock on process A
Now all four threads are blocked waiting for each other."
Regardless using pagefault_disable() ensures that no matter what locking
implementation is used a deadlock will not occur. Add an explicit
pagefault_disable() and a big comment to explain this for future souls
looking at this code.
[1] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/
[2] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/
Link: https://lkml.kernel.org/r/20221025220108.2366043-1-ira.weiny@intel.com
Link: https://lore.kernel.org/r/202210211215.9dc6efb5-yujie.liu@intel.com
Fixes: 7a7256d5f5 ("shmem: convert shmem_mfill_atomic_pte() to use a folio")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: kernel test robot <yujie.liu@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmap() and kmap_atomic() are being deprecated in favor of
kmap_local_page() which is appropriate for any thread local context.[1]
A recent locking bug report with userfaultfd showed that the conversion of
the kmap_atomic()'s in those code flows requires care with regard to the
prevention of deadlock.[2]
git archaeology implied that the recursion may not be an actual bug.[3]
However, depending on the implementation of the mmap_lock and the
condition of the call there may still be a deadlock.[4] So this is not
purely a lockdep issue. Considering a single threaded call stack there
are 3 options.
1) Different mm's are in play (no issue)
2) Readlock implementation is recursive and same mm is in play
(no issue)
3) Readlock implementation is _not_ recursive (issue)
The mmap_lock is recursive so with a single thread there is no issue.
However, Matthew pointed out a deadlock scenario when you consider
additional process' and threads thusly.
"The readlock implementation is only recursive if nobody else has taken a
write lock. If you have a multithreaded process, one of the other threads
can call mmap() and that will prevent recursion (due to fairness). Even
if it's a different process that you're trying to acquire the mmap read
lock on, you can still get into a deadly embrace. eg:
process A thread 1 takes read lock on own mmap_lock
process A thread 2 calls mmap, blocks taking write lock
process B thread 1 takes page fault, read lock on own mmap lock
process B thread 2 calls mmap, blocks taking write lock
process A thread 1 blocks taking read lock on process B
process B thread 1 blocks taking read lock on process A
Now all four threads are blocked waiting for each other."
Regardless using pagefault_disable() ensures that no matter what locking
implementation is used a deadlock will not occur.
Complete kmap conversion in userfaultfd by replacing the kmap() and
kmap_atomic() calls with kmap_local_page(). When replacing the
kmap_atomic() call ensure page faults continue to be disabled to support
the correct fall back behavior and add a comment to inform future souls of
the requirement.
[1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/
[2] https://lore.kernel.org/all/Y1Mh2S7fUGQ%2FiKFR@iweiny-desk3/
[3] https://lore.kernel.org/all/Y1MymJ%2FINb45AdaY@iweiny-desk3/
[4] https://lore.kernel.org/lkml/Y1bXBtGTCym77%2FoD@casper.infradead.org/
[ira.weiny@intel.com: v2]
Link: https://lkml.kernel.org/r/20221025220136.2366143-1-ira.weiny@intel.com
Link: https://lkml.kernel.org/r/20221024043452.1491677-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ensure that KMSAN builds replace memset/memcpy/memmove calls with the
respective __msan_XXX functions, and that none of the macros are redefined
twice. This should allow building kernel with both CONFIG_KMSAN and
CONFIG_FORTIFY_SOURCE.
Link: https://lkml.kernel.org/r/20221024212144.2852069-5-glider@google.com
Link: https://github.com/google/kmsan/issues/89
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Tamas K Lengyel <tamas.lengyel@zentific.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During THP migration, if THPs are not migrated but they are split and all
subpages are migrated successfully, migrate_pages() will still return the
number of THP pages that were not migrated. This will confuse the callers
of migrate_pages(). For example, the longterm pinning will failed though
all pages are migrated successfully.
Thus we should return 0 to indicate that all pages are migrated in this
case
Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com
Fixes: b5bade978e ("mm: migrate: fix the return value of migrate_pages()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Although page allocation always clears page->private in the first page or
head page of an allocation, it has never made a point of clearing
page->private in the tails (though 0 is often what is already there).
But now commit 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t
during THP split") issues a warning when page_tail->private is found to be
non-0 (unless it's swapcache).
Change that warning to dump page_tail (which also dumps head), instead of
just the head: so far we have seen dead000000000122, dead000000000003,
dead000000000001 or 0000000000000002 in the raw output for tail private.
We could just delete the warning, but today's consensus appears to want
page->private to be 0, unless there's a good reason for it to be set: so
now clear it in prep_compound_tail() (more general than just for THP; but
not for high order allocation, which makes no pass down the tails).
Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
Fixes: 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t during THP split")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A common use case for hugetlbfs is for the application to create
memory pools backed by huge pages, which then get handed over to
some malloc library (eg. jemalloc) for further management.
That malloc library may be doing MADV_DONTNEED calls on memory
that is no longer needed, expecting those calls to happen on
PAGE_SIZE boundaries.
However, currently the MADV_DONTNEED code rounds up any such
requests to HPAGE_PMD_SIZE boundaries. This leads to undesired
outcomes when jemalloc expects a 4kB MADV_DONTNEED, but 2MB of
memory get zeroed out, instead.
Use of pre-built shared libraries means that user code does not
always know the page size of every memory arena in use.
Avoid unexpected data loss with MADV_DONTNEED by rounding up
only to PAGE_SIZE (in do_madvise), and rounding down to huge
page granularity.
That way programs will only get as much memory zeroed out as
they requested.
Link: https://lkml.kernel.org/r/20221021192805.366ad573@imladris.surriel.com
Fixes: 90e7e7f5ef ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When !CONFIG_VM_BUG_ON, there is warning of
clang-analyzer-deadcode.DeadStores:
Value stored to 'mt' during its initialization is never read.
Link: https://lkml.kernel.org/r/20221021101555.7992-2-quic_aiquny@quicinc.com
Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In sysfs, we use attribute name "cpumap" or "cpus" for cpu mask and
"cpulist" or "cpus_list" for cpu list. For example, in my system,
$ cat /sys/devices/system/node/node0/cpumap
f,ffffffff
$ cat /sys/devices/system/cpu/cpu2/topology/core_cpus
0,00100004
$ cat cat /sys/devices/system/node/node0/cpulist
0-35
$ cat /sys/devices/system/cpu/cpu2/topology/core_cpus_list
2,20
It looks reasonable to use "nodemap" for node mask and "nodelist" for
node list. So, rename the attribute to follow the naming convention.
Link: https://lkml.kernel.org/r/20221020015122.290097-1-ying.huang@intel.com
Fixes: 9832fb8783 ("mm/demotion: expose memory tier details via sysfs")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 6edda04ccc ("mm/kmemleak: prevent soft lockup in first object
iteration loop of kmemleak_scan()") adds cond_resched() in the first
object iteration loop of kmemleak_scan(). However, it turns that the 2nd
objection iteration loop can still cause soft lockup to happen in some
cases. So add a cond_resched() call in the 2nd and 3rd loops as well to
prevent that and for completeness.
Link: https://lkml.kernel.org/r/20221020175619.366317-1-longman@redhat.com
Fixes: 6edda04ccc ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following has been observed when running stressng mmap since commit
b653db7735 ("mm: Clear page->private when splitting or migrating a page")
watchdog: BUG: soft lockup - CPU#75 stuck for 26s! [stress-ng:9546]
CPU: 75 PID: 9546 Comm: stress-ng Tainted: G E 6.0.0-revert-b653db77-fix+ #29 0357d79b60fb09775f678e4f3f64ef0579ad1374
Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
RIP: 0010:xas_descend+0x28/0x80
Code: cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 <48> 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02 89 c2
RSP: 0018:ffffbbf02a2236a8 EFLAGS: 00000246
RAX: ffff9cab7d6a0002 RBX: ffffe04b0af88040 RCX: 0000000000000002
RDX: 0000000000000030 RSI: ffff9cab60509b60 RDI: ffffbbf02a2236c0
RBP: 0000000000000000 R08: ffff9cab60509b60 R09: ffffbbf02a2236c0
R10: 0000000000000001 R11: ffffbbf02a223698 R12: 0000000000000000
R13: ffff9cab4e28da80 R14: 0000000000039c01 R15: ffff9cab4e28da88
FS: 00007fab89b85e40(0000) GS:ffff9cea3fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fab84e00000 CR3: 00000040b73a4003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
xas_load+0x3a/0x50
__filemap_get_folio+0x80/0x370
? put_swap_page+0x163/0x360
pagecache_get_page+0x13/0x90
__try_to_reclaim_swap+0x50/0x190
scan_swap_map_slots+0x31e/0x670
get_swap_pages+0x226/0x3c0
folio_alloc_swap+0x1cc/0x240
add_to_swap+0x14/0x70
shrink_page_list+0x968/0xbc0
reclaim_page_list+0x70/0xf0
reclaim_pages+0xdd/0x120
madvise_cold_or_pageout_pte_range+0x814/0xf30
walk_pgd_range+0x637/0xa30
__walk_page_range+0x142/0x170
walk_page_range+0x146/0x170
madvise_pageout+0xb7/0x280
? asm_common_interrupt+0x22/0x40
madvise_vma_behavior+0x3b7/0xac0
? find_vma+0x4a/0x70
? find_vma+0x64/0x70
? madvise_vma_anon_name+0x40/0x40
madvise_walk_vmas+0xa6/0x130
do_madvise+0x2f4/0x360
__x64_sys_madvise+0x26/0x30
do_syscall_64+0x5b/0x80
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x67/0x80
? do_syscall_64+0x67/0x80
? common_interrupt+0x8b/0xa0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The problem can be reproduced with the mmtests config
config-workload-stressng-mmap. It does not always happen and when it
triggers is variable but it has happened on multiple machines.
The intent of commit b653db7735 patch was to avoid the case where
PG_private is clear but folio->private is not-NULL. However, THP tail
pages uses page->private for "swp_entry_t if folio_test_swapcache()" as
stated in the documentation for struct folio. This patch only clobbers
page->private for tail pages if the head page was not in swapcache and
warns once if page->private had an unexpected value.
Link: https://lkml.kernel.org/r/20221019134156.zjyyn5aownakvztf@techsingularity.net
Fixes: b653db7735 ("mm: Clear page->private when splitting or migrating a page")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>