linux/mm
Joao Martins 5b24eeef06 mm/page_alloc: split prep_compound_page into head and tail subparts
Patch series "mm, device-dax: Introduce compound pages in devmap", v7.

This series converts device-dax to use compound pages, and moves away
from the 'struct page per basepage on PMD/PUD' that is done today.

Doing so
 1) unlocks a few noticeable improvements on unpin_user_pages() and
    makes device-dax+altmap case 4x times faster in pinning (numbers
    below and in last patch)
 2) as mentioned in various other threads it's one important step
    towards cleaning up ZONE_DEVICE refcounting.

I've split the compound pages on devmap part from the rest based on
recent discussions on devmap pending and future work planned[5][6].
There is consensus that device-dax should be using compound pages to
represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
specialization of the dax parts.  I will pursue the rest of the work in
parallel once this part is merged, particular the GUP-{slow,fast}
improvements [7] and the tail struct page deduplication memory savings
part[8].

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page().  This is used
when initializing zone device when we bring up device-dax namespaces.

Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and
it introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done
today).  The property describe the page order of the metadata
essentially.  While at it do a few cleanups in device-dax in patches
5-9.  Finally enable device-dax usage of devmap @vmemmap_shift to a
value based on its own @align property.  @vmemmap_shift returns 0 by
default (which is today's case of base pages in devmap, like fsdax or
the others) and the usage of compound devmap is optional.  Starting with
device-dax (*not* fsdax) we enable it by default.  There are a few
pinning improvements particular on the unpinning case and altmap, as
well as unpin_user_page_range_dirty_lock() being just as effective as
THP/hugetlb[0] pages.

    $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
    [altmap]
    (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms

     $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
    [altmap with -m 127004]
    (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
and without altmap), alongside gup_test selftests with dynamic dax
regions and static dax regions.  Coupled with ndctl unit tests for
dynamic dax devices that exercise all of this.  Note, for dynamic dax
regions I had to revert commit 8aa83e6395 ("x86/setup: Call
early_reserve_memory() earlier"), it is a known issue that this commit
broke efi_fake_mem=.

This patch (of 11):

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for compound page
metadata.

Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:25 +02:00
..
damon mm/damon/dbgfs: fix 'struct pid' leaks in 'dbgfs_target_ids_write()' 2021-12-31 09:20:12 -08:00
kasan mm: defer kmemleak object creation of module_alloc() 2022-01-15 16:30:25 +02:00
kfence kfence: fix memory leak when cat kfence objects 2021-12-25 12:20:55 -08:00
backing-dev.c mm: bdi: initialize bdi_min_ratio when bdi is unregistered 2021-12-10 17:10:56 -08:00
balloon_compaction.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
bootmem_info.c mm/bootmem_info.c: mark __init on register_page_bootmem_info_section 2021-09-03 09:58:14 -07:00
cleancache.c
cma_debug.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma_sysfs.c mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
cma.c memblock: rename memblock_free to memblock_phys_free 2021-11-06 13:30:41 -07:00
cma.h mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
compaction.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm: debug_vm_pgtable: don't use __P000 directly 2021-11-06 13:30:33 -07:00
debug.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
dmapool.c mm/dmapool: use DEVICE_ATTR_RO macro 2021-06-29 10:53:52 -07:00
early_ioremap.c mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
fadvise.c
failslab.c
filemap.c filemap: remove PageHWPoison check from next_uptodate_page() 2021-12-10 17:10:55 -08:00
folio-compat.c mm/filemap: Add FGP_STABLE 2021-10-18 07:49:41 -04:00
frontswap.c mm/mempool: minor coding style tweaks 2021-05-05 11:27:27 -07:00
gup_test.c selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages 2021-05-05 11:27:26 -07:00
gup_test.h selftests/vm: gup_test: fix test flag 2021-05-05 11:27:26 -07:00
gup.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
highmem.c Fixes for 5.16 folios: 2021-11-25 10:13:56 -08:00
hmm.c mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled 2021-09-08 18:45:52 -07:00
huge_memory.c Memory folios 2021-11-01 08:47:59 -07:00
hugetlb_cgroup.c hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro 2021-11-06 13:30:39 -07:00
hugetlb_vmemmap.c mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
hugetlb_vmemmap.h mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate 2021-06-30 20:47:25 -07:00
hugetlb.c hugetlbfs: fix issue of preallocation of gigantic pages can't work 2021-12-10 17:10:56 -08:00
hwpoison-inject.c mm: hwpoison: don't drop slab caches for offlining non-LRU page 2021-09-03 09:58:15 -07:00
init-mm.c mm: add setup_initial_init_mm() helper 2021-07-08 11:48:21 -07:00
internal.h Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
interval_tree.c
io-mapping.c mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
ioremap.c mm: move ioremap_page_range to vmalloc.c 2021-09-08 11:50:24 -07:00
Kconfig percpu: km: ensure it is used with NOMMU (either UP or SMP) 2021-12-06 12:45:09 -05:00
Kconfig.debug
khugepaged.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
kmemleak.c kmemleak: fix kmemleak false positive report with HW tag-based kasan enable 2022-01-15 16:30:25 +02:00
ksm.c mm/migrate: Add folio_migrate_flags() 2021-10-18 07:49:39 -04:00
list_lru.c mm: list_lru: only add memcg-aware lrus to the global lru list 2021-11-06 13:30:35 -07:00
maccess.c ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault 2021-08-20 11:39:25 +01:00
madvise.c mm: use pidfd_get_task() 2021-10-14 13:29:22 +02:00
Makefile mm/util: Add folio_mapping() and folio_file_mapping() 2021-09-27 09:27:30 -04:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: remove double Note in kerneldoc 2021-07-01 11:06:02 -07:00
memblock.c arm64 fixes for -rc1 2021-11-10 11:29:30 -08:00
memcontrol.c mm: slab: make slab iterator functions static 2022-01-15 16:30:25 +02:00
memfd.c mm,hugetlb: remove mlock ulimit for SHM_HUGETLB 2021-11-09 10:02:48 -08:00
memory_hotplug.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
memory-failure.c mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page() 2021-12-25 12:20:56 -08:00
memory.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
mempolicy.c mm: mempolicy: fix THP allocations escaping mempolicy restrictions 2021-12-25 12:20:55 -08:00
mempool.c mm: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
memremap.c mm/memcg: Convert mem_cgroup_uncharge() to take a folio 2021-09-27 09:27:31 -04:00
memtest.c
migrate.c mm/migrate.c: remove MIGRATE_PFN_LOCKED 2021-11-11 09:34:35 -08:00
mincore.c
mlock.c mm/memcg: Add folio_lruvec_relock_irq() and folio_lruvec_relock_irqsave() 2021-09-27 09:27:31 -04:00
mm_init.c include/linux/page-flags-layout.h: cleanups 2021-04-30 11:20:42 -07:00
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmap.c mm,hugetlb: remove mlock ulimit for SHM_HUGETLB 2021-11-09 10:02:48 -08:00
mmu_gather.c
mmu_notifier.c
mmzone.c
mprotect.c mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() 2021-11-06 13:30:36 -07:00
mremap.c mm, hugepages: add mremap() support for hugepage backed vma 2021-11-06 13:30:39 -07:00
msync.c
nommu.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
oom_kill.c pidfd.v5.16 2021-11-10 16:02:08 -08:00
page_alloc.c mm/page_alloc: split prep_compound_page into head and tail subparts 2022-01-15 16:30:25 +02:00
page_counter.c
page_ext.c mm/page_ext.c: fix a comment 2021-11-06 13:30:34 -07:00
page_idle.c mm/idle_page_tracking: make PG_idle reusable 2021-09-08 11:50:24 -07:00
page_io.c for-5.16/block-2021-10-29 2021-11-01 09:19:50 -07:00
page_isolation.c mm/page_isolation: guard against possible putback unisolated page 2021-11-06 13:30:40 -07:00
page_owner.c mm/page_owner.c: modify the type of argument "order" in some functions 2021-11-11 09:34:35 -08:00
page_poison.c
page_reporting.c mm/page_reporting: allow driver to specify reporting order 2021-06-29 10:53:47 -07:00
page_reporting.h mm/page_reporting: export reporting order as module parameter 2021-06-29 10:53:47 -07:00
page_vma_mapped.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
page-writeback.c folio: Add a function to get the host inode for a folio 2021-11-10 21:16:52 +00:00
pagewalk.c mm: pagewalk: fix walk for hugepage tables 2021-06-29 10:53:49 -07:00
percpu-internal.h Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2021-07-01 17:17:24 -07:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c percpu: rework memcg accounting 2021-06-05 20:43:15 +00:00
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c memblock: use memblock_free for freeing virtual pointers 2021-11-06 13:30:41 -07:00
pgalloc-track.h mm: fix typos in comments 2021-05-07 00:26:35 -07:00
pgtable-generic.c mm/thp: fix __split_huge_pmd_locked() on shmem migration entry 2021-06-16 09:24:42 -07:00
process_vm_access.c mm/process_vm_access.c: remove duplicate include 2021-05-05 11:27:27 -07:00
ptdump.c
readahead.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
rmap.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
rodata_test.c
secretmem.c mm/secretmem: avoid letting secretmem_users drop to zero 2021-10-28 17:18:55 -07:00
shmem.c fs: Remove FS_THP_SUPPORT 2021-11-17 10:36:35 -05:00
shuffle.c
shuffle.h mm/shuffle: fix section mismatch warning 2021-05-22 15:09:07 -10:00
slab_common.c mm: slab: make slab iterator functions static 2022-01-15 16:30:25 +02:00
slab.c mm: emit the "free" trace report before freeing memory in kmem_cache_free() 2021-11-20 10:35:54 -08:00
slab.h mm: slab: make slab iterator functions static 2022-01-15 16:30:25 +02:00
slob.c mm: emit the "free" trace report before freeing memory in kmem_cache_free() 2021-11-20 10:35:54 -08:00
slub.c mm/slub: fix endianness bug for alloc/free_traces attributes 2021-12-10 17:10:56 -08:00
sparse-vmemmap.c mm: remove redundant smp_wmb() 2021-11-06 13:30:36 -07:00
sparse.c memblock: use memblock_free for freeing virtual pointers 2021-11-06 13:30:41 -07:00
swap_cgroup.c
swap_slots.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
swap_state.c mm/workingset: Convert workingset_refault() to take a folio 2021-10-18 07:49:40 -04:00
swap.c mm/swap.c:put_pages_list(): reinitialise the page list 2021-11-20 10:35:54 -08:00
swapfile.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
truncate.c vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
usercopy.c
userfaultfd.c Revert "mm: shmem: don't truncate page if memory failure happens" 2021-11-13 12:03:03 -08:00
util.c mm: Remove folio_test_single 2021-11-17 10:36:35 -05:00
vmacache.c
vmalloc.c mm: defer kmemleak object creation of module_alloc() 2022-01-15 16:30:25 +02:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c mm: vmscan: reduce throttling due to a failure to make progress -fix 2021-12-31 13:12:55 -08:00
vmstat.c mm: vmstat.c: make extfrag_index show more pretty 2021-11-06 13:30:42 -07:00
workingset.c Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
z3fold.c mm/z3fold: add kerneldoc fields for z3fold_pool 2021-07-01 11:06:03 -07:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
zsmalloc.c mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() 2021-11-06 13:30:43 -07:00
zswap.c mm/zswap.c: fix two bugs in zswap_writeback_entry() 2021-06-30 20:47:31 -07:00