linux/mm
Aneesh Kumar K.V c777e2a8b6 powerpc/mm: Fix Multi hit ERAT cause by recent THP update
With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

		CPU0				CPU1
shrink_page_list()
  add_to_swap()
    split_huge_page_to_list()
      __split_huge_pmd_locked()
        pmdp_huge_clear_flush_notify()
	// pmd_none() == true
					exit_mmap()
					  unmap_vmas()
					    zap_pmd_range()
					      // no action on pmd since pmd_none() == true
	pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-02-15 21:10:04 +11:00
..
kasan kasan: fix kmemleak false-positive in kasan_module_alloc() 2015-11-20 16:17:32 -08:00
backing-dev.c mm: memcontrol: export root_mem_cgroup 2016-01-14 16:00:49 -08:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-01-12 20:47:06 +02:00
bootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
cleancache.c
cma_debug.c
cma.c mm/cma.c: suppress warning 2015-11-05 19:34:48 -08:00
cma.h
compaction.c mm/compaction.c: __compact_pgdat() code cleanuup 2016-01-14 16:00:49 -08:00
debug-pagealloc.c
debug.c mm: rework mapcount accounting to enable 4k mapping of THPs 2016-01-15 17:56:32 -08:00
dmapool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c
failslab.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
filemap.c mm: differentiate page_mapped() from page_mapcount() for compound pages 2016-01-15 17:56:32 -08:00
frame_vector.c mm: fix docbook comment for get_vaddr_frames() 2015-11-05 19:34:48 -08:00
frontswap.c
gup.c mm: bring in additional flag for fixup_user_fault to signal unlock 2016-01-15 17:56:32 -08:00
highmem.c
huge_memory.c powerpc/mm: Fix Multi hit ERAT cause by recent THP update 2016-02-15 21:10:04 +11:00
hugetlb_cgroup.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
hugetlb.c mm: rework mapcount accounting to enable 4k mapping of THPs 2016-01-15 17:56:32 -08:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c
internal.h thp: reintroduce split_huge_page() 2016-01-15 17:56:32 -08:00
interval_tree.c
Kconfig mm: re-enable THP 2016-01-15 17:56:32 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c Revert "gfp: add __GFP_NOACCOUNT" 2016-01-14 16:00:49 -08:00
ksm.c mm/ksm.c: mark stable page dirty 2016-01-15 17:56:32 -08:00
list_lru.c memcg: simplify and inline __mem_cgroup_from_kmem 2015-11-05 19:34:48 -08:00
maccess.c mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe 2015-11-05 19:34:48 -08:00
madvise.c mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called 2016-01-15 17:56:32 -08:00
Makefile media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
memblock.c mm/memblock: introduce for_each_memblock_type() 2016-01-14 16:00:49 -08:00
memcontrol.c memcg: only free spare array when readers are done 2016-01-15 17:56:32 -08:00
memory_hotplug.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
memory-failure.c mm: soft-offline: exit with failure for non anonymous thp 2016-01-15 17:56:32 -08:00
memory.c mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd 2016-01-15 17:56:32 -08:00
mempolicy.c mm: mempolicy: skip non-migratable VMAs when setting MPOL_MF_LAZY 2016-01-15 17:56:32 -08:00
mempool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
memtest.c
migrate.c thp: introduce deferred_split_huge_page() 2016-01-15 17:56:32 -08:00
mincore.c mm, thp: remove infrastructure for handling splitting PMDs 2016-01-15 17:56:32 -08:00
mlock.c mm/mlock.c: change can_do_mlock return value type to boolean 2016-01-15 17:56:32 -08:00
mm_init.c
mmap.c mm: fix locking order in mm_take_all_locks() 2016-01-15 17:56:32 -08:00
mmu_context.c
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c mm/mmzone.c: memmap_valid_within() can be boolean 2016-01-14 16:00:49 -08:00
mprotect.c mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd 2016-01-15 17:56:32 -08:00
mremap.c mm, thp: remove infrastructure for handling splitting PMDs 2016-01-15 17:56:32 -08:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
nommu.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
oom_kill.c mm, shmem: add internal shmem resident memory accounting 2016-01-14 16:00:49 -08:00
page_alloc.c mm/page_alloc.c: remove unused struct zone *z variable 2016-01-15 17:56:32 -08:00
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: add page_check_address_transhuge() helper 2016-01-15 17:56:32 -08:00
page_io.c
page_isolation.c mm/page_isolation: do some cleanup in "undo_isolate_page_range" 2016-01-15 17:56:32 -08:00
page_owner.c
page-writeback.c mm: page_alloc: generalize the dirty balance reserve 2016-01-14 16:00:49 -08:00
pagewalk.c thp: rename split_huge_page_pmd() to split_huge_pmd() 2016-01-15 17:56:32 -08:00
percpu-km.c
percpu-vm.c
percpu.c mm/percpu: use offset_in_page macro 2015-11-05 19:34:48 -08:00
pgtable-generic.c mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd 2016-01-15 17:56:32 -08:00
process_vm_access.c
quicklist.c
readahead.c mm: move lru_to_page to mm_inline.h 2016-01-14 16:00:49 -08:00
rmap.c mm: fix locking order in mm_take_all_locks() 2016-01-15 17:56:32 -08:00
shmem.c memcg: adjust to support new THP refcounting 2016-01-15 17:56:32 -08:00
slab_common.c slab: add SLAB_ACCOUNT flag 2016-01-14 16:00:49 -08:00
slab.c mm/slab.c: add a helper function get_first_slab 2016-01-14 16:00:49 -08:00
slab.h slab: add SLAB_ACCOUNT flag 2016-01-14 16:00:49 -08:00
slob.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slub.c page-flags: define PG_locked behavior on compound pages 2016-01-15 17:56:32 -08:00
sparse-vmemmap.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
sparse.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
swap_cgroup.c
swap_state.c mm: support madvise(MADV_FREE) 2016-01-15 17:56:32 -08:00
swap.c mm, x86: get_user_pages() for dax mappings 2016-01-15 17:56:32 -08:00
swapfile.c mm: make swapoff more robust against soft dirty 2016-01-15 17:56:32 -08:00
truncate.c
userfaultfd.c memcg: adjust to support new THP refcounting 2016-01-15 17:56:32 -08:00
util.c mm: prepare page_referenced() and page_idle to new THP refcounting 2016-01-15 17:56:32 -08:00
vmacache.c mm/vmacache: inline vmacache_valid_mm() 2015-11-05 19:34:48 -08:00
vmalloc.c mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment 2016-01-15 17:56:32 -08:00
vmpressure.c memcg: avoid vmpressure oops when memcg disabled 2016-01-14 16:00:49 -08:00
vmscan.c mm: support madvise(MADV_FREE) 2016-01-15 17:56:32 -08:00
vmstat.c mm: support madvise(MADV_FREE) 2016-01-15 17:56:32 -08:00
workingset.c
zbud.c mm/zbud.c: use list_last_entry() instead of list_tail_entry() 2016-01-15 11:40:52 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c zsmalloc: reorganize struct size_class to pack 4 bytes hole 2016-01-15 11:40:52 -08:00
zswap.c mm/zswap: change incorrect strncmp use to strcmp 2015-12-18 14:25:40 -08:00