From 8639a847b0e11f8d2daa3eafe15a9609c91fd357 Mon Sep 17 00:00:00 2001 From: Atsushi Kumagai Date: Thu, 28 Apr 2016 16:18:18 -0700 Subject: [PATCH 01/20] kexec: update VMCOREINFO for compound_order/dtor makedumpfile refers page.lru.next to get the order of compound pages for page filtering. However, now the order is stored in page.compound_order, hence VMCOREINFO should be updated to export the offset of page.compound_order. The fact is, page.compound_order was introduced already in kernel 4.0, but the offset of it was the same as page.lru.next until kernel 4.3, so this was not actual problem. The above can be said also for page.lru.prev and page.compound_dtor, it's necessary to detect hugetlbfs pages. Further, the content was changed from direct address to the ID which means dtor. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.4.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai Acked-by: Dave Young Cc: "Eric W. Biederman" Cc: Vivek Goyal Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec_core.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 8d34308ea449..cbbb4c724149 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1415,6 +1415,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(page, lru); VMCOREINFO_OFFSET(page, _mapcount); VMCOREINFO_OFFSET(page, private); + VMCOREINFO_OFFSET(page, compound_dtor); + VMCOREINFO_OFFSET(page, compound_order); VMCOREINFO_OFFSET(pglist_data, node_zones); VMCOREINFO_OFFSET(pglist_data, nr_zones); #ifdef CONFIG_FLAT_NODE_MEM_MAP @@ -1447,8 +1449,8 @@ static int __init crash_save_vmcoreinfo_init(void) #ifdef CONFIG_X86 VMCOREINFO_NUMBER(KERNEL_IMAGE_SIZE); #endif -#ifdef CONFIG_HUGETLBFS - VMCOREINFO_SYMBOL(free_huge_page); +#ifdef CONFIG_HUGETLB_PAGE + VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); #endif arch_crash_save_vmcoreinfo(); From d7f53518f713d3d9bf5ed150f943853fb94e7473 Mon Sep 17 00:00:00 2001 From: Atsushi Kumagai Date: Thu, 28 Apr 2016 16:18:21 -0700 Subject: [PATCH 02/20] kexec: export OFFSET(page.compound_head) to find out compound tail page PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail page's page->mapping has just a poisoned data since commit 1c290f642101 ("mm: sanitize page->mapping for tail pages"). If makedumpfile checks page->mapping of a compound tail page to distinguish anonymous page as usual, it must fail in newer kernel. So it's necessary to export OFFSET(page.compound_head) to avoid checking compound tail pages. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.5.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai Acked-by: Dave Young Cc: "Eric W. Biederman" Cc: Vivek Goyal Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec_core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index cbbb4c724149..1391d3ee3b86 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1417,6 +1417,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(page, private); VMCOREINFO_OFFSET(page, compound_dtor); VMCOREINFO_OFFSET(page, compound_order); + VMCOREINFO_OFFSET(page, compound_head); VMCOREINFO_OFFSET(pglist_data, node_zones); VMCOREINFO_OFFSET(pglist_data, nr_zones); #ifdef CONFIG_FLAT_NODE_MEM_MAP From 66ee95d16a7f1b7b4f1dd74a2d81c6e19dc29a14 Mon Sep 17 00:00:00 2001 From: Steve Capper Date: Thu, 28 Apr 2016 16:18:24 -0700 Subject: [PATCH 03/20] mm: exclude HugeTLB pages from THP page_mapped() logic HugeTLB pages cannot be split, so we use the compound_mapcount to track rmaps. Currently page_mapped() will check the compound_mapcount, but will also go through the constituent pages of a THP compound page and query the individual _mapcount's too. Unfortunately, page_mapped() does not distinguish between HugeTLB and THP compound pages and assumes that a compound page always needs to have HPAGE_PMD_NR pages querying. For most cases when dealing with HugeTLB this is just inefficient, but for scenarios where the HugeTLB page size is less than the pmd block size (e.g. when using contiguous bit on ARM) this can lead to crashes. This patch adjusts the page_mapped function such that we skip the unnecessary THP reference checks for HugeTLB pages. Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages") Signed-off-by: Steve Capper Acked-by: Kirill A. Shutemov Cc: Will Deacon Cc: Catalin Marinas Cc: Michal Hocko Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index a55e5be0894f..79b6c18d0a38 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1031,6 +1031,8 @@ static inline bool page_mapped(struct page *page) page = compound_head(page); if (atomic_read(compound_mapcount_ptr(page)) >= 0) return true; + if (PageHuge(page)) + return false; for (i = 0; i < hpage_nr_pages(page); i++) { if (atomic_read(&page[i]._mapcount) >= 0) return true; From aa88b68c3b1dce8bc3fd54c8a7372a777ff265cd Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 28 Apr 2016 16:18:27 -0700 Subject: [PATCH 04/20] thp: keep huge zero page pinned until tlb flush Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com Signed-off-by: Kirill A. Shutemov Reported-by: Andrea Arcangeli Reviewed-by: Andrea Arcangeli Cc: "Aneesh Kumar K.V" Cc: Mel Gorman Cc: Hugh Dickins Cc: Johannes Weiner Cc: Dave Hansen Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 5 +++++ mm/huge_memory.c | 6 +++--- mm/swap.c | 5 +++++ 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7008623e24b1..d7b9e5346fba 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -152,6 +152,7 @@ static inline bool is_huge_zero_pmd(pmd_t pmd) } struct page *get_huge_zero_page(void); +void put_huge_zero_page(void); #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) @@ -208,6 +209,10 @@ static inline bool is_huge_zero_page(struct page *page) return false; } +static inline void put_huge_zero_page(void) +{ + BUILD_BUG(); +} static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, int flags) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 86f9f8b82f8e..5346de05f471 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -232,7 +232,7 @@ retry: return READ_ONCE(huge_zero_page); } -static void put_huge_zero_page(void) +void put_huge_zero_page(void) { /* * Counter should never go to zero here. Only shrinker can put @@ -1684,12 +1684,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (vma_is_dax(vma)) { spin_unlock(ptl); if (is_huge_zero_pmd(orig_pmd)) - put_huge_zero_page(); + tlb_remove_page(tlb, pmd_page(orig_pmd)); } else if (is_huge_zero_pmd(orig_pmd)) { pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); atomic_long_dec(&tlb->mm->nr_ptes); spin_unlock(ptl); - put_huge_zero_page(); + tlb_remove_page(tlb, pmd_page(orig_pmd)); } else { struct page *page = pmd_page(orig_pmd); page_remove_rmap(page, true); diff --git a/mm/swap.c b/mm/swap.c index a0bc206b4ac6..03aacbcb013f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -728,6 +728,11 @@ void release_pages(struct page **pages, int nr, bool cold) zone = NULL; } + if (is_huge_zero_page(page)) { + put_huge_zero_page(); + continue; + } + page = compound_head(page); if (!put_page_testzero(page)) continue; From 314e9b75b3f673a61ecbf0bba401830bc7dfe121 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Thu, 28 Apr 2016 16:18:30 -0700 Subject: [PATCH 05/20] mailmap: fix Krzysztof Kozlowski's misspelled name Patchwork introduced a garbled Polish character in commit 1e3012d0fdc5 ("crypto: s5p-sss - Use memcpy_toio for iomem annotated memory") so fix the mail mapping. Additionally prefer to use kernel.org account for personal work, instead of my gmail address. Signed-off-by: Krzysztof Kozlowski Cc: Herbert Xu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 90c0aefc276d..c45e71f551cd 100644 --- a/.mailmap +++ b/.mailmap @@ -79,6 +79,7 @@ Kay Sievers Kenneth W Chen Konstantin Khlebnikov Koushik +Krzysztof Kozlowski Kuninori Morimoto Leonid I Ananiev Linas Vepstas From 3486b85a29c1741db99d0c522211c82d2b7a56d0 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 28 Apr 2016 16:18:32 -0700 Subject: [PATCH 06/20] mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check Khugepaged detects own VMAs by checking vm_file and vm_ops but this way it cannot distinguish private /dev/zero mappings from other special mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap. This fixes false-positive VM_BUG_ON and prevents installing THP where they are not expected. Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com Fixes: 78f11a255749 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups") Signed-off-by: Konstantin Khlebnikov Reported-by: Dmitry Vyukov Acked-by: Vlastimil Babka Acked-by: Kirill A. Shutemov Cc: Dmitry Vyukov Cc: Andrea Arcangeli Cc: stable Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5346de05f471..df67b53ae3c5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1960,10 +1960,9 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma, * page fault if needed. */ return 0; - if (vma->vm_ops) + if (vma->vm_ops || (vm_flags & VM_NO_THP)) /* khugepaged not yet working on file or special mappings */ return 0; - VM_BUG_ON_VMA(vm_flags & VM_NO_THP, vma); hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = vma->vm_end & HPAGE_PMD_MASK; if (hstart < hend) @@ -2352,8 +2351,7 @@ static bool hugepage_vma_check(struct vm_area_struct *vma) return false; if (is_vma_temporary_stack(vma)) return false; - VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma); - return true; + return !(vma->vm_flags & VM_NO_THP); } static void collapse_huge_page(struct mm_struct *mm, From 28093f9f34cedeaea0f481c58446d9dac6dd620f Mon Sep 17 00:00:00 2001 From: Gerald Schaefer Date: Thu, 28 Apr 2016 16:18:35 -0700 Subject: [PATCH 07/20] numa: fix /proc//numa_maps for THP In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: Gerald Schaefer Cc: Naoya Horiguchi Cc: "Kirill A . Shutemov" Cc: Konstantin Khlebnikov Cc: Michal Hocko Cc: Vlastimil Babka Cc: Jerome Marchand Cc: Johannes Weiner Cc: Dave Hansen Cc: Mel Gorman Cc: Dan Williams Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Michael Holzheu Cc: [4.3+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 33 ++++++++++++++++++++++++++++++--- include/linux/mm.h | 2 ++ mm/memory.c | 40 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 72 insertions(+), 3 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 229cb546bee0..541583510cfb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1518,6 +1518,32 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, return page; } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct page *can_gather_numa_stats_pmd(pmd_t pmd, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct page *page; + int nid; + + if (!pmd_present(pmd)) + return NULL; + + page = vm_normal_page_pmd(vma, addr, pmd); + if (!page) + return NULL; + + if (PageReserved(page)) + return NULL; + + nid = page_to_nid(page); + if (!node_isset(nid, node_states[N_MEMORY])) + return NULL; + + return page; +} +#endif + static int gather_pte_stats(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -1527,14 +1553,14 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, pte_t *orig_pte; pte_t *pte; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - pte_t huge_pte = *(pte_t *)pmd; struct page *page; - page = can_gather_numa_stats(huge_pte, vma, addr); + page = can_gather_numa_stats_pmd(*pmd, vma, addr); if (page) - gather_stats(page, md, pte_dirty(huge_pte), + gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); spin_unlock(ptl); return 0; @@ -1542,6 +1568,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; +#endif orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { struct page *page = can_gather_numa_stats(*pte, vma, addr); diff --git a/include/linux/mm.h b/include/linux/mm.h index 79b6c18d0a38..864d7221de84 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1140,6 +1140,8 @@ struct zap_details { struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); +struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t pmd); int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); diff --git a/mm/memory.c b/mm/memory.c index 93897f23cc11..305537fc8640 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -789,6 +789,46 @@ out: return pfn_to_page(pfn); } +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t pmd) +{ + unsigned long pfn = pmd_pfn(pmd); + + /* + * There is no pmd_special() but there may be special pmds, e.g. + * in a direct-access (dax) mapping, so let's just replicate the + * !HAVE_PTE_SPECIAL case from vm_normal_page() here. + */ + if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { + if (vma->vm_flags & VM_MIXEDMAP) { + if (!pfn_valid(pfn)) + return NULL; + goto out; + } else { + unsigned long off; + off = (addr - vma->vm_start) >> PAGE_SHIFT; + if (pfn == vma->vm_pgoff + off) + return NULL; + if (!is_cow_mapping(vma->vm_flags)) + return NULL; + } + } + + if (is_zero_pfn(pfn)) + return NULL; + if (unlikely(pfn > highest_memmap_pfn)) + return NULL; + + /* + * NOTE! We still have PageReserved() pages in the page tables. + * eg. VDSO mappings can cause them to exist. + */ +out: + return pfn_to_page(pfn); +} +#endif + /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range From 7bf52fb891b64b8d61caf0b82060adb9db761aec Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 28 Apr 2016 16:18:38 -0700 Subject: [PATCH 08/20] mm: vmscan: reclaim highmem zone if buffer_heads is over limit We have been reclaimed highmem zone if buffer_heads is over limit but commit 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()") changed the behavior so it doesn't reclaim highmem zone although buffer_heads is over the limit. This patch restores the logic. Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()") Signed-off-by: Minchan Kim Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b934223eaa45..c638b28310fc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2553,7 +2553,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) sc->gfp_mask |= __GFP_HIGHMEM; for_each_zone_zonelist_nodemask(zone, z, zonelist, - requested_highidx, sc->nodemask) { + gfp_zone(sc->gfp_mask), sc->nodemask) { enum zone_type classzone_idx; if (!populated_zone(zone)) From b06bad17c7435b600a1d7a35b56eff25e1d3dbc0 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 28 Apr 2016 16:18:41 -0700 Subject: [PATCH 09/20] mm: call swap_slot_free_notify() with page lock held Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in page_swap_info. The reason is that page_endio in rw_page unlocks the page if read I/O is completed so we need to hold a PG_lock again to check PageSwapCache. Otherwise, the page can be removed from swapcache. Kernel BUG at c00f9040 [verbose debug info unavailable] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM Modules linked in: CPU: 4 PID: 13446 Comm: RenderThread Tainted: G W 3.10.84-g9f14aec-dirty #73 task: c3b73200 ti: dd192000 task.ti: dd192000 PC is at page_swap_info+0x10/0x2c LR is at swap_slot_free_notify+0x18/0x6c pc : [] lr : [] psr: 400f0113 sp : dd193d78 ip : c2deb1e4 fp : da015180 r10: 00000000 r9 : 000200da r8 : c120fe08 r7 : 00000000 r6 : 00000000 r5 : c249a6c0 r4 : = c249a6c0 r3 : 00000000 r2 : 40080009 r1 : 200f0113 r0 : = c249a6c0 .. .. Call Trace: page_swap_info+0x10/0x2c swap_slot_free_notify+0x18/0x6c swap_readpage+0x90/0x11c read_swap_cache_async+0x134/0x1ac swapin_readahead+0x70/0xb0 handle_pte_fault+0x320/0x6fc handle_mm_fault+0xc0/0xf0 do_page_fault+0x11c/0x36c do_DataAbort+0x34/0x118 Fixes: 3f2b1a04f44933f2 ("zram: revive swap_slot_free_notify") Signed-off-by: Minchan Kim Tested-by: Kyeongdon Kim Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_io.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/page_io.c b/mm/page_io.c index cd92e3d67a32..985f23cfa79b 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -353,7 +353,11 @@ int swap_readpage(struct page *page) ret = bdev_read_page(sis->bdev, swap_page_sector(page), page); if (!ret) { - swap_slot_free_notify(page); + if (trylock_page(page)) { + swap_slot_free_notify(page); + unlock_page(page); + } + count_vm_event(PSWPIN); return 0; } From d7e69488bd04de165667f6bc741c1c0ec6042ab9 Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 28 Apr 2016 16:18:44 -0700 Subject: [PATCH 10/20] mm/hwpoison: fix wrong num_poisoned_pages accounting Currently, migration code increses num_poisoned_pages on *failed* migration page as well as successfully migrated one at the trial of memory-failure. It will make the stat wrong. As well, it marks the page as PG_HWPoison even if the migration trial failed. It would mean we cannot recover the corrupted page using memory-failure facility. This patches fixes it. Signed-off-by: Minchan Kim Reported-by: Vlastimil Babka Acked-by: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 6c822a7b27e0..f9dfb18a4eba 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -975,7 +975,13 @@ out: dec_zone_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page)); /* Soft-offlined page shouldn't go through lru cache list */ - if (reason == MR_MEMORY_FAILURE) { + if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) { + /* + * With this release, we free successfully migrated + * page and set PG_HWPoison on just freed page + * intentionally. Although it's rather weird, it's how + * HWPoison flag works at the moment. + */ put_page(page); if (!test_set_page_hwpoison(page)) num_poisoned_pages_inc(); From eeb68d1d2d48b5bfe9d79d8eac35df576bb79a99 Mon Sep 17 00:00:00 2001 From: Frank Rowand Date: Thu, 28 Apr 2016 16:18:47 -0700 Subject: [PATCH 11/20] .mailmap: add Frank Rowand Set current email address to replace obsolete email addresses. Signed-off-by: Frank Rowand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .mailmap | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.mailmap b/.mailmap index c45e71f551cd..c156a8b4d845 100644 --- a/.mailmap +++ b/.mailmap @@ -48,6 +48,9 @@ Felix Kuhling Felix Moeller Filipe Lautert Franck Bui-Huu +Frank Rowand +Frank Rowand +Frank Rowand Frank Zago Greg Kroah-Hartman Greg Kroah-Hartman From fd901c95388b3bd5a6f749ed1d677a672b992298 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 28 Apr 2016 16:18:49 -0700 Subject: [PATCH 12/20] mm: wake kcompactd before kswapd's short sleep When kswapd goes to sleep it checks if the node is balanced and at first it sleeps only for HZ/10 time, then rechecks if the node is still balanced and nobody has woken it during the initial sleep. Only then it goes fully sleep until an allocation slowpath wakes it up again. For higher-order allocations, waking up kcompactd is done only before the full sleep. This turns out to be an issue in case another high-order allocation fails during the initial sleep. It will wake kswapd up, however kswapd considers the zone balanced from the order-0 perspective, and will just quickly try to sleep again. So if there's a longer stream of high-order allocations hitting the slowpath and waking up kswapd, it might never actually wake up kcompactd, which may be considered a regression from kswapd-based compaction. In the worst case, it might be that a single allocation that cannot direct reclaim/compact itself is waking kswapd in the retry loop and preventing kcompactd from being woken up and unblocking it. This patch makes sure kcompactd is woken up in such situations by simply moving the wakeup before the short initial sleep. More efficient solution would be to wake kcompactd immediately instead of kswapd if the node is already order-0 balanced, but in that case we should also move reset_isolation_suitable() call to kcompactd so it's not adding to the allocator's latency. Since it's late in the 4.6 cycle, let's go with the simpler change for now. Fixes: accf62422b3a ("mm, kswapd: replace kswapd compaction with waking up kcompactd") Signed-off-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: "Kirill A. Shutemov" Cc: Rik van Riel Cc: Joonsoo Kim Cc: Mel Gorman Cc: David Rientjes Cc: Michal Hocko Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c638b28310fc..142cb61f4822 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3318,6 +3318,20 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, /* Try to sleep for a short interval */ if (prepare_kswapd_sleep(pgdat, order, remaining, balanced_classzone_idx)) { + /* + * Compaction records what page blocks it recently failed to + * isolate pages from and skips them in the future scanning. + * When kswapd is going to sleep, it is reasonable to assume + * that pages and compaction may succeed so reset the cache. + */ + reset_isolation_suitable(pgdat); + + /* + * We have freed the memory, now we should compact it to make + * allocation of the requested order possible. + */ + wakeup_kcompactd(pgdat, order, classzone_idx); + remaining = schedule_timeout(HZ/10); finish_wait(&pgdat->kswapd_wait, &wait); prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); @@ -3341,20 +3355,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, */ set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold); - /* - * Compaction records what page blocks it recently failed to - * isolate pages from and skips them in the future scanning. - * When kswapd is going to sleep, it is reasonable to assume - * that pages and compaction may succeed so reset the cache. - */ - reset_isolation_suitable(pgdat); - - /* - * We have freed the memory, now we should compact it to make - * allocation of the requested order possible. - */ - wakeup_kcompactd(pgdat, order, classzone_idx); - if (!kthread_should_stop()) schedule(); From bdab42dfc974d15303afbf259f340f374a453974 Mon Sep 17 00:00:00 2001 From: James Morse Date: Thu, 28 Apr 2016 16:18:52 -0700 Subject: [PATCH 13/20] kcov: don't trace the code coverage code Kcov causes the compiler to add a call to __sanitizer_cov_trace_pc() in every basic block. Ftrace patches in a call to _mcount() to each function it has annotated. Letting these mechanisms annotate each other is a bad thing. Break the loop by adding 'notrace' to __sanitizer_cov_trace_pc() so that ftrace won't try to patch this code. This patch lets arm64 with KCOV and STACK_TRACER boot. Signed-off-by: James Morse Acked-by: Dmitry Vyukov Cc: Alexander Potapenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/kcov.c b/kernel/kcov.c index 3efbee0834a8..78bed7125515 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -43,7 +43,7 @@ struct kcov { * Entry point from instrumented code. * This is called once per basic-block/edge. */ -void __sanitizer_cov_trace_pc(void) +void notrace __sanitizer_cov_trace_pc(void) { struct task_struct *t; enum kcov_mode mode; From 36f05ae8bce904b4c8105363e6227a79d343bda6 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Thu, 28 Apr 2016 16:18:55 -0700 Subject: [PATCH 14/20] kcov: don't profile branches in kcov Profiling 'if' statements in __sanitizer_cov_trace_pc() leads to unbound recursion and crash: __sanitizer_cov_trace_pc() -> ftrace_likely_update -> __sanitizer_cov_trace_pc() ... Define DISABLE_BRANCH_PROFILING to disable this tracer. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/kcov.c b/kernel/kcov.c index 78bed7125515..a02f2dddd1d7 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -1,5 +1,6 @@ #define pr_fmt(fmt) "kcov: " fmt +#define DISABLE_BRANCH_PROFILING #include #include #include From a320817c68e3fa1fc3ddaa709a1ad45cf533693b Mon Sep 17 00:00:00 2001 From: Ananth N Mavinakayanahalli Date: Thu, 28 Apr 2016 16:18:58 -0700 Subject: [PATCH 15/20] Ananth has moved The current ID is going away soon... update email address Signed-off-by: Ananth N Mavinakayanahalli Cc: Masami Hiramatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 849133608da7..7f72d54f065d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6400,7 +6400,7 @@ F: mm/kmemleak.c F: mm/kmemleak-test.c KPROBES -M: Ananth N Mavinakayanahalli +M: Ananth N Mavinakayanahalli M: Anil S Keshavamurthy M: "David S. Miller" M: Masami Hiramatsu From b73413647ee36406561618f68d0661d49dc47489 Mon Sep 17 00:00:00 2001 From: xuejiufei Date: Thu, 28 Apr 2016 16:19:01 -0700 Subject: [PATCH 16/20] ocfs2/dlm: return zero if deref_done message is successfully handled dlm_deref_lockres_done_handler() should return zero if the message is successfully handled. Fixes: 60d663cb5273 ("ocfs2/dlm: add DEREF_DONE message"). Signed-off-by: xuejiufei Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/dlm/dlmmaster.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c index 9aed6e202201..13719d3f35f8 100644 --- a/fs/ocfs2/dlm/dlmmaster.c +++ b/fs/ocfs2/dlm/dlmmaster.c @@ -2455,6 +2455,8 @@ int dlm_deref_lockres_done_handler(struct o2net_msg *msg, u32 len, void *data, spin_unlock(&dlm->spinlock); + ret = 0; + done: dlm_put(dlm); return ret; From c2e7e00b715d3c65f301bec8559d6af4ef8098ab Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Thu, 28 Apr 2016 16:19:03 -0700 Subject: [PATCH 17/20] mm/memory-failure: fix race with compound page split/merge get_hwpoison_page() must recheck relation between head and tail pages. n-horiguchi said: without this recheck, the race causes kernel to pin an irrelevant page, and finally makes kernel crash for refcount mismatch. Signed-off-by: Konstantin Khlebnikov Acked-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 78f5f2641b91..ca5acee53b7a 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -888,7 +888,15 @@ int get_hwpoison_page(struct page *page) } } - return get_page_unless_zero(head); + if (get_page_unless_zero(head)) { + if (head == compound_head(page)) + return 1; + + pr_info("MCE: %#lx cannot catch tail\n", page_to_pfn(page)); + put_page(head); + } + + return 0; } EXPORT_SYMBOL_GPL(get_hwpoison_page); From 99f23c2cded85a377325aa9fd374ffa3d55d1088 Mon Sep 17 00:00:00 2001 From: Vladimir Zapolskiy Date: Thu, 28 Apr 2016 16:19:06 -0700 Subject: [PATCH 18/20] rapidio: fix potential NULL pointer dereference The change fixes improper check for a returned error value by class_create() function, which on error returns ERR_PTR() value, thus the original check always results in a dead code on error path. Signed-off-by: Vladimir Zapolskiy Signed-off-by: Alexandre Bounine Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/rapidio/devices/rio_mport_cdev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/rapidio/devices/rio_mport_cdev.c b/drivers/rapidio/devices/rio_mport_cdev.c index 5d4d91846357..96168b819044 100644 --- a/drivers/rapidio/devices/rio_mport_cdev.c +++ b/drivers/rapidio/devices/rio_mport_cdev.c @@ -2669,9 +2669,9 @@ static int __init mport_init(void) /* Create device class needed by udev */ dev_class = class_create(THIS_MODULE, DRV_NAME); - if (!dev_class) { + if (IS_ERR(dev_class)) { rmcd_error("Unable to create " DRV_NAME " class"); - return -EINVAL; + return PTR_ERR(dev_class); } ret = alloc_chrdev_region(&dev_number, 0, RIO_MAX_MPORTS, DRV_NAME); From 33334e25769c6ad69b983379578f42581d99a2f9 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Thu, 28 Apr 2016 16:19:09 -0700 Subject: [PATCH 19/20] lib/stackdepot.c: allow the stack trace hash to be zero Do not bail out from depot_save_stack() if the stack trace has zero hash. Initially depot_save_stack() silently dropped stack traces with zero hashes, however there's actually no point in reserving this zero value. Reported-by: Joonsoo Kim Signed-off-by: Alexander Potapenko Acked-by: Andrey Ryabinin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/stackdepot.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/lib/stackdepot.c b/lib/stackdepot.c index 654c9d87e83a..9e0b0315a724 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -210,10 +210,6 @@ depot_stack_handle_t depot_save_stack(struct stack_trace *trace, goto fast_exit; hash = hash_stack(trace->entries, trace->nr_entries); - /* Bad luck, we won't store this stack. */ - if (hash == 0) - goto exit; - bucket = &stack_table[hash & STACK_HASH_MASK]; /* From 7c88a292dfcd6979e839493ef18d04770eb3bad0 Mon Sep 17 00:00:00 2001 From: Xishi Qiu Date: Thu, 28 Apr 2016 16:19:11 -0700 Subject: [PATCH 20/20] Documentation/sysctl/vm.txt: update numa_zonelist_order description Commit 3193913ce62c ("mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit") changes the default value of numa_zonelist_order. Update the document. Signed-off-by: Xishi Qiu Cc: Mel Gorman Cc: Johannes Weiner Cc: Rik van Riel Cc: David Rientjes Cc: Kamezawa Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index cb0368459da3..34a5fece3121 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -581,15 +581,16 @@ Specify "[Nn]ode" for node order "Zone Order" orders the zonelists by zone type, then by node within each zone. Specify "[Zz]one" for zone order. -Specify "[Dd]efault" to request automatic configuration. Autoconfiguration -will select "node" order in following case. -(1) if the DMA zone does not exist or -(2) if the DMA zone comprises greater than 50% of the available memory or -(3) if any node's DMA zone comprises greater than 70% of its local memory and - the amount of local memory is big enough. +Specify "[Dd]efault" to request automatic configuration. -Otherwise, "zone" order will be selected. Default order is recommended unless -this is causing problems for your system/application. +On 32-bit, the Normal zone needs to be preserved for allocations accessible +by the kernel, so "zone" order will be selected. + +On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" +order will be selected. + +Default order is recommended unless this is causing problems for your +system/application. ==============================================================