linux

History

Oscar Salvador 369fa227c2 mm: make alloc_contig_range handle free hugetlb pages alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0 CPU1 free_huge_page() isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock_irq(&hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2021-05-05 11:27:22 -07:00
..
kasan	kasan: record task_work_add() call stack	2021-04-30 11:20:42 -07:00
kfence	kfence: make compatible with kmemleak	2021-03-25 09:22:55 -07:00
backing-dev.c	mm/backing-dev.c: use might_alloc()	2021-02-26 09:41:01 -08:00
balloon_compaction.c
cleancache.c
cma_debug.c	mm/cma: change cma mutex to irq safe spinlock	2021-05-05 11:27:21 -07:00
cma.c	mm/cma: change cma mutex to irq safe spinlock	2021-05-05 11:27:21 -07:00
cma.h	mm/cma: change cma mutex to irq safe spinlock	2021-05-05 11:27:21 -07:00
compaction.c	mm: make alloc_contig_range handle free hugetlb pages	2021-05-05 11:27:22 -07:00
debug_page_ref.c
debug_vm_pgtable.c	mm: HUGE_VMAP arch support cleanup	2021-04-30 11:20:40 -07:00
debug.c	mm/debug: improve memcg debugging	2021-02-24 13:38:27 -08:00
dmapool.c	mm/dmapool: switch from strlcpy to strscpy	2021-04-30 11:20:39 -07:00
early_ioremap.c	mm/early_ioremap.c: use __func__ instead of function name	2021-02-26 09:41:02 -08:00
fadvise.c	mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED	2020-10-13 18:38:29 -07:00
failslab.c
filemap.c	dax: account DAX entries as nrpages	2021-05-05 11:27:19 -07:00
frontswap.c	mm/frontswap: mark various intentional data races	2020-08-14 19:56:56 -07:00
gup_test.c	mm/gup_test.c: mark gup_test_init as __init function	2020-12-15 12:13:38 -08:00
gup_test.h	selftests/vm: gup_test: introduce the dump_pages() sub-test	2020-12-15 12:13:38 -08:00
gup.c	mm: gup: remove FOLL_SPLIT	2021-04-30 11:20:37 -07:00
highmem.c	mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP	2021-03-25 09:22:55 -07:00
hmm.c	mm: do page fault accounting in handle_mm_fault	2020-08-12 10:58:02 -07:00
huge_memory.c	mm: huge_memory: debugfs for file-backed THP split	2021-05-05 11:27:21 -07:00
hugetlb_cgroup.c	hugetlb: make free_huge_page irq safe	2021-05-05 11:27:22 -07:00
hugetlb.c	mm: make alloc_contig_range handle free hugetlb pages	2021-05-05 11:27:22 -07:00
hwpoison-inject.c	mm,hwpoison-inject: don't pin for hwpoison_filter	2020-10-16 11:11:16 -07:00
init-mm.c	mm/gup: prevent gup_fast from racing with COW during fork	2020-12-15 12:13:39 -08:00
internal.h	mm,compaction: let isolate_migratepages_{range,block} return error codes	2021-05-05 11:27:22 -07:00
interval_tree.c	mm/interval_tree: add comments to improve code readability	2021-04-30 11:20:38 -07:00
io-mapping.c	mm: add a io_mapping_map_user helper	2021-04-30 11:20:39 -07:00
ioremap.c	mm: move vmap_range from mm/ioremap.c to mm/vmalloc.c	2021-04-30 11:20:40 -07:00
Kconfig	mm: generalize HUGETLB_PAGE_SIZE_VARIABLE	2021-05-05 11:27:20 -07:00
Kconfig.debug	mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO	2020-12-15 12:13:46 -08:00
khugepaged.c	khugepaged: remove meaningless !pte_present() check in khugepaged_scan_pmd()	2021-05-05 11:27:21 -07:00
kmemleak.c	mm/kmemleak.c: fix a typo	2021-04-30 11:20:36 -07:00
ksm.c	mm: cleanup kstrto*() usage	2020-12-15 12:13:47 -08:00
list_lru.c	mm/list_lru.c: remove kvfree_rcu_local()	2021-02-24 13:38:30 -08:00
maccess.c	uaccess: add force_uaccess_{begin,end} helpers	2020-08-12 10:57:59 -07:00
madvise.c	mm/madvise: replace ptrace attach requirement for process_madvise	2021-03-13 11:27:30 -08:00
Makefile	mm: add a io_mapping_map_user helper	2021-04-30 11:20:39 -07:00
mapping_dirty_helpers.c	mm/mapping_dirty_helpers: guard hugepage pud's usage	2021-04-16 16:10:37 -07:00
memblock.c	memblock: remove return value of memblock_free_all()	2021-02-22 13:01:23 -08:00
memcontrol.c	mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages()	2021-04-30 11:20:38 -07:00
memfd.c
memory_hotplug.c	arm64: mte: Map hotplugged memory as Normal Tagged	2021-03-10 10:56:46 +00:00
memory-failure.c	mm/memory-failure: unnecessary amount of unmapping	2021-04-30 11:20:44 -07:00
memory.c	mm: apply_to_pte_range warn and fail if a large pte is encountered	2021-04-30 11:20:39 -07:00
mempolicy.c	mm/mempolicy: fix mpol_misplaced kernel-doc	2021-04-30 11:20:43 -07:00
mempool.c	kasan, mm: integrate page_alloc init with HW_TAGS	2021-04-30 11:20:41 -07:00
memremap.c	mm/memremap.c: fix improper SPDX comment style	2021-04-30 11:20:37 -07:00
memtest.c
migrate.c	mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask	2021-04-30 11:20:42 -07:00
mincore.c	inode: make init and permission helpers idmapped mount aware	2021-01-24 14:27:16 +01:00
mlock.c	mm/mlock: stop counting mlocked pages when none vma is found	2021-02-26 09:41:01 -08:00
mm_init.c	include/linux/page-flags-layout.h: cleanups	2021-04-30 11:20:42 -07:00
mmap_lock.c	mm: mmap_lock: add tracepoints around lock acquisition	2020-12-15 12:13:41 -08:00
mmap.c	Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"	2021-04-30 11:20:39 -07:00
mmu_gather.c	mm: eliminate "expecting prototype" kernel-doc warnings	2021-04-16 16:10:36 -07:00
mmu_notifier.c	mm/mmu_notifiers: ensure range_end() is paired with range_start()	2021-03-25 09:22:55 -07:00
mmzone.c	mm/lru: replace pgdat lru_lock with lruvec lock	2020-12-15 14:48:04 -08:00
mprotect.c	mm/mprotect.c: optimize error detection in do_mprotect_pkey()	2021-02-24 13:38:30 -08:00
mremap.c	Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"	2021-04-30 11:20:39 -07:00
msync.c	mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start	2021-04-30 11:20:37 -07:00
nommu.c	mm/nommu: Fix return type of filemap_map_pages()	2021-01-28 14:10:31 +00:00
oom_kill.c	mm: eliminate "expecting prototype" kernel-doc warnings	2021-04-16 16:10:36 -07:00
page_alloc.c	mm,compaction: let isolate_migratepages_{range,block} return error codes	2021-05-05 11:27:22 -07:00
page_counter.c	mm: page_counter: mitigate consequences of a page_counter underflow	2021-04-30 11:20:38 -07:00
page_ext.c	mm: fix some spelling mistakes in comments	2020-12-15 22:46:19 -08:00
page_idle.c	mm: page_idle_get_page() does not need lru_lock	2020-12-15 14:48:03 -08:00
page_io.c	swap: fix swapfile read/write offset	2021-03-02 17:25:46 -07:00
page_isolation.c	mm/page_isolation: do not isolate the max order page	2020-12-15 12:13:45 -08:00
page_owner.c	mm: page_owner: detect page_owner recursion via task_struct	2021-04-30 11:20:36 -07:00
page_poison.c	mm: page_poison: print page info when corruption is caught	2021-04-30 11:20:36 -07:00
page_reporting.c	mm/page_reporting: use list_entry_is_head() in page_reporting_cycle()	2021-02-24 13:38:30 -08:00
page_reporting.h
page_vma_mapped.c	mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pte	2020-12-15 12:13:41 -08:00
page-writeback.c	mm: page-writeback: simplify memcg handling in test_clear_page_writeback()	2021-04-30 11:20:37 -07:00
pagewalk.c
percpu-internal.h	percpu: make pcpu_nr_empty_pop_pages per chunk type	2021-04-09 13:58:38 +00:00
percpu-km.c	mm: memcg/percpu: account percpu memory to memory cgroups	2020-08-12 10:57:55 -07:00
percpu-stats.c	percpu: make pcpu_nr_empty_pop_pages per chunk type	2021-04-09 13:58:38 +00:00
percpu-vm.c	mm/vmalloc: remove unmap_kernel_range	2021-04-30 11:20:40 -07:00
percpu.c	percpu: make pcpu_nr_empty_pop_pages per chunk type	2021-04-09 13:58:38 +00:00
pgalloc-track.h	mm: move p?d_alloc_track to separate header file	2020-08-07 11:33:26 -07:00
pgtable-generic.c	mm/pgtable-generic.c: optimize the VM_BUG_ON condition in pmdp_huge_clear_flush()	2021-02-24 13:38:30 -08:00
process_vm_access.c	mm/process_vm_access.c: include compat.h	2021-01-12 18:12:54 -08:00
ptdump.c	mm: ptdump: fix build failure	2021-04-16 16:10:37 -07:00
readahead.c	mm: Implement readahead_control pageset expansion	2021-04-23 10:14:29 +01:00
rmap.c	mm/rmap: correct obsolete comment of page_get_anon_vma()	2021-02-26 09:41:01 -08:00
rodata_test.c	mm/rodata_test.c: fix missing function declaration	2020-08-21 09:52:53 -07:00
shmem.c	shmem: allow reporting fanotify events with file handles on tmpfs	2021-04-19 16:03:48 +02:00
shuffle.c	mm: eliminate "expecting prototype" kernel-doc warnings	2021-04-16 16:10:36 -07:00
shuffle.h	mm/shuffle: remove dynamic reconfiguration	2020-08-07 11:33:29 -07:00
slab_common.c	mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds	2021-04-30 11:20:36 -07:00
slab.c	kasan, mm: integrate slab init_on_free with HW_TAGS	2021-04-30 11:20:41 -07:00
slab.h	kasan, mm: integrate slab init_on_alloc with HW_TAGS	2021-04-30 11:20:41 -07:00
slob.c	mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels	2021-03-08 14:18:46 -08:00
slub.c	kasan, mm: integrate slab init_on_free with HW_TAGS	2021-04-30 11:20:41 -07:00
sparse-vmemmap.c	mm/sparse: only sub-section aligned range would be populated	2020-08-07 11:33:27 -07:00
sparse.c	mm/sparse: add the missing sparse_buffer_fini() in error branch	2021-04-30 11:20:39 -07:00
swap_cgroup.c
swap_slots.c	mm/swap_slots.c: remove redundant NULL check	2021-02-24 13:38:28 -08:00
swap_state.c	mm: stop accounting shadow entries	2021-05-05 11:27:19 -07:00
swap.c	mm: remove pagevec_lookup_entries	2021-02-26 09:40:59 -08:00
swapfile.c	swap: fix swapfile read/write offset	2021-03-02 17:25:46 -07:00
truncate.c	mm: stop accounting shadow entries	2021-05-05 11:27:19 -07:00
usercopy.c	mm/usercopy.c: delete duplicated word	2020-08-12 10:57:58 -07:00
userfaultfd.c	hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share()	2021-05-05 11:27:20 -07:00
util.c	mm: move page_mapping_file to pagemap.h	2021-04-30 11:20:37 -07:00
vmacache.c	kernel: better document the use_mm/unuse_mm API contract	2020-06-10 19:14:18 -07:00
vmalloc.c	mm/vmalloc: remove an empty line	2021-04-30 11:20:40 -07:00
vmpressure.c
vmscan.c	mm/vmscan: restore zone_reclaim_mode ABI	2021-02-24 13:38:34 -08:00
vmstat.c	mm/vmstat.c: erase latency in vmstat_shepherd	2021-02-26 09:41:00 -08:00
workingset.c	mm: stop accounting shadow entries	2021-05-05 11:27:19 -07:00
z3fold.c	z3fold: prevent reclaim/free race for headless pages	2021-03-25 09:22:55 -07:00
zbud.c	mm: set the sleep_mapped to true for zbud and z3fold	2021-02-26 09:41:01 -08:00
zpool.c	mm/zswap: add the flag can_sleep_mapped	2021-02-26 09:41:01 -08:00
zsmalloc.c	mm/zsmalloc.c: use page_private() to access page->private	2021-02-26 09:41:01 -08:00
zswap.c	mm/zswap: add the flag can_sleep_mapped	2021-02-26 09:41:01 -08:00