linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-27 21:33:00 +00:00

A mirror of the official Linux kernel repository just in case

Go to file

Zach O'Keefe 34488399fa mm/madvise: add file and shmem support to MADV_COLLAPSE Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). On success, the backing memory will be a hugepage. For the memory range and process provided, the page tables will synchronously have a huge pmd installed, mapping the THP. Other mappings of the file extent mapped by the memory range may be added to a set of entries that khugepaged will later process and attempt update their page tables to map the THP by a pmd. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Since khugepaged is single threaded, this change now introduces possibility of collapse contexts racing in file collapse path. There a important few places to consider: (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. We could have the memory collapsed out from under us, but the next xas_for_each() iteration will correctly pick up the hugepage. The hugepage might not be up to date (insofar as copying of small page contents might not have completed - the page still may be locked), but regardless what small page index we were iterating over, we'll find the hugepage and identify it as a suitably aligned compound page of order HPAGE_PMD_ORDER. In khugepaged path, we locklessly check the value of the pmd, and only add it to deferred collapse array if we find pmd mapping pte table. This is fine, since other values that could have raced in right afterwards denote failure, or that the memory was successfully collapsed, so we don't need further processing. In madvise path, we'll take mmap_lock() in write to serialize against page table updates and will know what to do based on the true value of the pmd: recheck all ptes if we point to a pte table, directly install the pmd, if the pmd has been cleared, but memory not yet faulted, or nothing at all if we find a huge pmd. It's worth putting emphasis here on how we treat the none pmd here. If khugepaged has processed this mm's page tables already, it will have left the pmd cleared (ready for refault by the process). Depending on the VMA flags and sysfs settings, amount of RAM on the machine, and the current load, could be a relatively common occurrence - and as such is one we'd like to handle successfully in MADV_COLLAPSE. When we see the none pmd in collapse_pte_mapped_thp(), we've locked mmap_lock in write and checked (a) huepaged_vma_check() to see if the backing memory is appropriate still, along with VMA sizing and appropriate hugepage alignment within the file, and (b) we've found a hugepage head of order HPAGE_PMD_ORDER at the offset in the file mapped by our hugepage-aligned virtual address. Even though the common-case is likely race with khugepaged, given these checks (regardless how we got here - we could be operating on a completely different file than originally checked in hpage_collapse_scan_file() for all we know) it should be safe to directly make the pmd a huge pmd pointing to this hugepage. (2) collapse_file() is mostly serialized on the same file extent by lock sequence: \| lock hupepage \| lock mapping->i_pages \| lock 1st page \| unlock mapping->i_pages \| <page checks> \| lock mapping->i_pages \| page_ref_freeze(3) \| xas_store(hugepage) \| unlock mapping->i_pages \| page_ref_unfreeze(1) \| unlock 1st page V unlock hugepage Once a context (who already has their fresh hugepage locked) locks mapping->i_pages exclusively, it will hold said lock until it locks the first page, and it will hold that lock until the after the hugepage has been added to the page cache (and will unlock the hugepage after page table update, though that isn't important here). A racing context that loses the race for mapping->i_pages will then lose the race to locking the first page. Here - depending on how far the other racing context has gotten - we might find the new hugepage (in which case we'll exit cleanly when we check PageTransCompound()), or we'll find the "old" 1st small page (in which we'll exit cleanly when we discover unexpected refcount of 2 after isolate_lru_page()). This is assuming we are able to successfully lock the page we find - in shmem path, we could just fail the trylock and exit cleanly anyways. Failure path in collapse_file() is similar: once we hold lock on 1st small page, we are serialized against other collapse contexts. Before the 1st small page is unlocked, we add it back to the pagecache and unfreeze the refcount appropriately. Contexts who lost the race to the 1st small page will then find the same 1st small page with the correct refcount and will be able to proceed. [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()] Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove check for multi-add in khugepaged_add_pte_mapped_thp()] Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2022-10-03 14:03:33 -07:00
arch	x86: kmsan: handle CPU entry area	2022-10-03 14:03:26 -07:00
block	block: kmsan: skip bio block merging logic for KMSAN	2022-10-03 14:03:23 -07:00
certs	Kbuild updates for v5.20	2022-08-10 10:40:41 -07:00
crypto	crypto: kmsan: disable accelerated configs under KMSAN	2022-10-03 14:03:22 -07:00
Documentation	mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds	2022-10-03 14:03:33 -07:00
drivers	crypto: kmsan: disable accelerated configs under KMSAN	2022-10-03 14:03:22 -07:00
fs	mm: fs: initialize fsdata passed to write_begin/write_end interface	2022-10-03 14:03:25 -07:00
include	mm/madvise: add file and shmem support to MADV_COLLAPSE	2022-10-03 14:03:33 -07:00
init	init: kmsan: call KMSAN initialization routines	2022-10-03 14:03:21 -07:00
io_uring	io_uring/net: save address for sendzc async execution	2022-08-25 07:52:30 -06:00
ipc	ipc/shm: use VMA iterator instead of linked list	2022-09-26 19:46:21 -07:00
kernel	mm/madvise: add file and shmem support to MADV_COLLAPSE	2022-10-03 14:03:33 -07:00
lib	kmsan: disable strscpy() optimization under KMSAN	2022-10-03 14:03:22 -07:00
LICENSES	LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers	2021-12-16 14:33:10 +01:00
mm	mm/madvise: add file and shmem support to MADV_COLLAPSE	2022-10-03 14:03:33 -07:00
net	Including fixes from ipsec and netfilter (with one broken Fixes tag).	2022-08-25 14:03:58 -07:00
samples	Tracing updates for 5.20 / 6.0	2022-08-05 09:41:12 -07:00
scripts	kmsan: add KMSAN runtime core	2022-10-03 14:03:19 -07:00
security	security: kmsan: fix interoperability with auto-initialization	2022-10-03 14:03:23 -07:00
sound	sound fixes for 6.0-rc2	2022-08-19 09:46:11 -07:00
tools	selftests/vm: retry on EAGAIN for MADV_COLLAPSE selftest	2022-10-03 14:03:33 -07:00
usr	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
virt	KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()	2022-08-19 04:05:43 -04:00
.clang-format	PCI/DOE: Add DOE mailbox support functions	2022-07-19 15:38:04 -07:00
.cocciconfig
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes
.gitignore	kbuild: split the second line of .mod into .usyms	2022-05-08 03:16:59 +09:00
.mailmap	.mailmap: update Luca Ceresoli's e-mail address	2022-08-28 14:02:46 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	drm for 5.20/6.0	2022-08-03 19:52:08 -07:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	x86: kmsan: handle CPU entry area	2022-10-03 14:03:26 -07:00
Makefile	kmsan: add KMSAN runtime core	2022-10-03 14:03:19 -07:00
README

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.