eb1d7a65f0
15278 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Yafang Shao
|
eb1d7a65f0 |
mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED
Our users reported that there're some random latency spikes when their RT process is running. Finally we found that latency spike is caused by FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on remote CPUs, and then waits the per-cpu work to complete. The wait time is uncertain, which may be tens millisecond. That behavior is unreasonable, because this process is bound to a specific CPU and the file is only accessed by itself, IOW, there should be no pagecache pages on a per-cpu pagevec of a remote CPU. That unreasonable behavior is partially caused by the wrong comparation of the number of invalidated pages and the number of the target. For example, if (count < (end_index - start_index + 1)) The count above is how many pages were invalidated in the local CPU, and (end_index - start_index + 1) is how many pages should be invalidated. The usage of (end_index - start_index + 1) is incorrect, because they are virtual addresses, which may not mapped to pages. Besides that, there may be holes between start and end. So we'd better check whether there are still pages on per-cpu pagevec after drain the local cpu, and then decide whether or not to call lru_add_drain_all(). After I applied it with a hotfix to our production environment, most of the lru_add_drain_all() can be avoided. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
27a83a609b |
mm/filemap: fix filemap_map_pages for THP
We dereference page->mapping and page->index directly after calling
find_subpage() and these fields are not valid for tail pages. While
commit
|
||
Matthew Wilcox (Oracle)
|
a8cf7f272b |
mm: add find_lock_head
Add a new FGP_HEAD flag which avoids calling find_subpage() and add a convenience wrapper for it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
63ec1973dd |
mm/shmem: return head page from find_lock_entry
Convert shmem_getpage_gfp() (the only remaining caller of find_lock_entry()) to cope with a head page being returned instead of the subpage for the index. [willy@infradead.org: fix BUG()s] Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/ Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
a6de4b4873 |
mm: convert find_get_entry to return the head page
There are only four callers remaining of find_get_entry(). get_shadow_from_swap_cache() only wants to see shadow entries and doesn't care about which page is returned. Push the find_subpage() call into find_lock_entry(), find_get_incore_page() and pagecache_get_page(). [willy@infradead.org: fix oops] Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
9dfc8ff34b |
i915: use find_lock_page instead of find_lock_entry
i915 does not want to see value entries. Switch it to use find_lock_page() instead, and remove the export of find_lock_entry(). Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage any future use. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
e6e88712e4 |
mm: optimise madvise WILLNEED
Instead of calling find_get_entry() for every page index, use an XArray iterator to skip over NULL entries, and avoid calling get_page(), because we only want the swap entries. [willy@infradead.org: fix LTP soft lockups] Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Qian Cai <cai@redhat.com> Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
f5df8635c5 |
mm: use find_get_incore_page in memcontrol
The current code does not protect against swapoff of the underlying swap device, so this is a bug fix as well as a worthwhile reduction in code complexity. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
61ef186557 |
mm: factor find_get_incore_page out of mincore_page
Patch series "Return head pages from find_*_entry", v2. This patch series started out as part of the THP patch set, but it has some nice effects along the way and it seems worth splitting it out and submitting separately. Currently find_get_entry() and find_lock_entry() return the page corresponding to the requested index, but the first thing most callers do is find the head page, which we just threw away. As part of auditing all the callers, I found some misuses of the APIs and some plain inefficiencies that I've fixed. The diffstat is unflattering, but I added more kernel-doc and a new wrapper. This patch (of 8); Provide this functionality from the swap cache. It's useful for more than just mincore(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
bac3cf4d01 |
mm, dump_page: rename head_mapcount() --> head_compound_mapcount()
Rename head_pincount() --> head_compound_pincount(). These names are more accurate (or less misleading) than the original ones. Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Qian Cai <cai@lca.pw> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
853322a671 |
mm/debug.c: do not dereference i_ino blindly
__dump_page() checks i_dentry is fetchable and i_ino is earlier in the struct than i_ino, so it ought to work fine, but it's possible that struct randomisation has reordered i_ino after i_dentry and the pointer is just wild enough that i_dentry is fetchable and i_ino isn't. Also print the inode number if the dentry is invalid. Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Dan Williams
|
b7b3c01b19 |
mm/memremap_pages: support multiple ranges per invocation
In support of device-dax growing the ability to front physically dis-contiguous ranges of memory, update devm_memremap_pages() to track multiple ranges with a single reference counter and devm instance. Convert all [devm_]memremap_pages() users to specify the number of ranges they are mapping in their 'struct dev_pagemap' instance. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: "Jérôme Glisse" <jglisse@redhat.co Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Dan Williams
|
a4574f63ed |
mm/memremap_pages: convert to 'struct range'
The 'struct resource' in 'struct dev_pagemap' is only used for holding resource span information. The other fields, 'name', 'flags', 'desc', 'parent', 'sibling', and 'child' are all unused wasted space. This is in preparation for introducing a multi-range extension of devm_memremap_pages(). The bulk of this change is unwinding all the places internal to libnvdimm that used 'struct resource' unnecessarily, and replacing instances of 'struct dev_pagemap'.res with 'struct dev_pagemap'.range. P2PDMA had a minor usage of the resource flags field, but only to report failures with "%pR". That is replaced with an open coded print of the range. [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()] Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> [xen] Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Dan Williams
|
a035b6bf86 |
mm/memory_hotplug: introduce default phys_to_target_node() implementation
In preparation to set a fallback value for dev_dax->target_node, introduce generic fallback helpers for phys_to_target_node() A generic implementation based on node-data or memblock was proposed, but as noted by Mike: "Here again, I would prefer to add a weak default for phys_to_target_node() because the "generic" implementation is not really generic. The fallback to reserved ranges is x86 specfic because on x86 most of the reserved areas is not in memblock.memory. AFAIK, no other architecture does this." The info message in the generic memory_add_physaddr_to_nid() implementation is fixed up to properly reflect that memory_add_physaddr_to_nid() communicates "online" node info and phys_to_target_node() indicates "target / to-be-onlined" node info. [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build] Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jia He <justin.he@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Hui Su
|
1abbef4f51 |
mm,kmemleak-test.c: move kmemleak-test.c to samples dir
kmemleak-test.c is just a kmemleak test module, which also can not be used as a built-in kernel module. Thus, i think it may should not be in mm dir, and move the kmemleak-test.c to samples/kmemleak/kmemleak-test.c. Fix the spelling of built-in by the way. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Rob Herring <robh@kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Divya Indi <divya.indi@oracle.com> Cc: Tomas Winkler <tomas.winkler@intel.com> Cc: David Howells <dhowells@redhat.com> Link: https://lkml.kernel.org/r/20200925183729.GA172837@rlk Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Davidlohr Bueso
|
c4b28963fd |
mm/kmemleak: rely on rcu for task stack scanning
kmemleak_scan() currently relies on the big tasklist_lock hammer to stabilize iterating through the tasklist. Instead, this patch proposes simply using rcu along with the rcu-safe for_each_process_thread flavor (without changing scan semantics), which doesn't make use of next_thread/p->thread_group and thus cannot race with exit. Furthermore, any races with fork() and not seeing the new child should be benign as it's not running yet and can also be detected by the next scan. Avoiding the tasklist_lock could prove beneficial for performance considering the scan operation is done periodically. I have seen improvements of 30%-ish when doing similar replacements on very pathological microbenchmarks (ie stressing get/setpriority(2)). However my main motivation is that it's one less user of the global lock, something that Linus has long time wanted to see gone eventually (if ever) even if the traditional fairness issues has been dealt with now with qrwlocks. Of course this is a very long ways ahead. This patch also kills another user of the deprecated tsk->thread_group. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Qian Cai <cai@lca.pw> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20200820203902.11308-1-dave@stgolabs.net Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Abel Wu
|
9cf7a11183 |
mm/slub: make add_full() condition more explicit
The commit below is incomplete, as it didn't handle the add_full() part.
commit
|
||
Abel Wu
|
9f986d998a |
mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc
The ALLOC_SLOWPATH statistics is missing in bulk allocation now. Fix it by doing statistics in alloc slow path. Signed-off-by: Abel Wu <wuyun.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hewenliang <hewenliang4@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Abel Wu
|
c270cf3041 |
mm/slub.c: branch optimization in free slowpath
The two conditions are mutually exclusive and gcc compiler will optimise this into if-else-like pattern. Given that the majority of free_slowpath is free_frozen, let's provide some hint to the compilers. Tests (perf bench sched messaging -g 20 -l 400000, executed 10x after reboot) are done and the summarized result: un-patched patched max. 192.316 189.851 min. 187.267 186.252 avg. 189.154 188.086 stdev. 1.37 0.99 Signed-off-by: Abel Wu <wuyun.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hewenliang <hewenliang4@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mateusz Nosek
|
c1ff3f9549 |
mm/slab.c: clean code by removing redundant if condition
The removed code was unnecessary and changed nothing in the flow, since in case of returning NULL by 'kmem_cache_alloc_node' returning 'freelist' from the function in question is the same as returning NULL. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: https://lkml.kernel.org/r/20200915230329.13002-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
3ad11d7ac8 |
block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm ZwfpDh2+Tg== =LzyE -----END PGP SIGNATURE----- Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ... |
||
Linus Torvalds
|
85ed13e78d |
Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat iovec cleanups from Al Viro: "Christoph's series around import_iovec() and compat variant thereof" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: security/keys: remove compat_keyctl_instantiate_key_iov mm: remove compat_process_vm_{readv,writev} fs: remove compat_sys_vmsplice fs: remove the compat readv/writev syscalls fs: remove various compat readv/writev helpers iov_iter: transparently handle compat iovecs in import_iovec iov_iter: refactor rw_copy_check_uvector and import_iovec iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c compat.h: fix a spelling error in <linux/compat.h> |
||
Linus Torvalds
|
50d228345a |
As hoped, things calmed down for docs this cycle; fewer changes and almost
no conflicts at all. This pull includes: - A reworked and expanded user-mode Linux document - Some simplifications and improvements for submitting-patches.rst - An emergency fix for (some) problems with Sphinx 3.x - Some welcome automarkup improvements to automatically generate cross-references to struct definitions and other documents - The usual collection of translation updates, typo fixes, etc. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl+ErNYPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Y284H/3bv9fahbg16AJcKYqJXFHGpDs3CsASPnJqQ 9HQoV5tg6Qd4kI3oFb+30l8SK73Wr2t685/DhOPDRR/vN3B5M1vOQvPRL/dEqiwi aUEhtMbnC/trSbteXsjGDWT+1EnI/+R3NFV++WiRp1XxE4DRXL3xySTeviR0IW+V rQxU7VCcVp0bklVH+gqjrsvqU5iZeckyZB6evc8X92ThhzjNprR5KVxxgl1wxcu/ dPYizHoKYVoLVNw50rwPGt2hmq9RpyDM6Xh9UhLHcA57ENyzr8NNTJBOT0tVMTWV smU01X/ECoy54kj1w8AKP+f7F0G7DUU+Jz68X0X/kYPq520dUs4= =Ovox -----END PGP SIGNATURE----- Merge tag 'docs-5.10' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "As hoped, things calmed down for docs this cycle; fewer changes and almost no conflicts at all. This includes: - A reworked and expanded user-mode Linux document - Some simplifications and improvements for submitting-patches.rst - An emergency fix for (some) problems with Sphinx 3.x - Some welcome automarkup improvements to automatically generate cross-references to struct definitions and other documents - The usual collection of translation updates, typo fixes, etc" * tag 'docs-5.10' of git://git.lwn.net/linux: (81 commits) gpiolib: Update indentation in driver.rst for code excerpts Documentation/admin-guide: tainted-kernels: Fix typo occured Documentation: better locations for sysfs-pci, sysfs-tagging docs: programming-languages: refresh blurb on clang support Documentation: kvm: fix a typo Documentation: Chinese translation of Documentation/arm64/amu.rst doc: zh_CN: index files in arm64 subdirectory mailmap: add entry for <mstarovoitov@marvell.com> doc: seq_file: clarify role of *pos in ->next() docs: trace: ring-buffer-design.rst: use the new SPDX tag Documentation: kernel-parameters: clarify "module." parameters Fix references to nommu-mmap.rst docs: rewrite admin-guide/sysctl/abi.rst docs: fb: Remove vesafb scrollback boot option docs: fb: Remove sstfb scrollback boot option docs: fb: Remove matroxfb scrollback boot option docs: fb: Remove framebuffer scrollback boot option docs: replace the old User Mode Linux HowTo with a new one Documentation/admin-guide: blockdev/ramdisk: remove use of "rdev" Documentation/admin-guide: README & svga: remove use of "rdev" ... |
||
Linus Torvalds
|
ed016af52e |
These are the locking updates for v5.10:
- Add deadlock detection for recursive read-locks. The rationale is outlined in: |
||
Linus Torvalds
|
6734e20e39 |
arm64 updates for 5.10
- Userspace support for the Memory Tagging Extension introduced by Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11. - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context switching. - Fix and subsequent rewrite of our Spectre mitigations, including the addition of support for PR_SPEC_DISABLE_NOEXEC. - Support for the Armv8.3 Pointer Authentication enhancements. - Support for ASID pinning, which is required when sharing page-tables with the SMMU. - MM updates, including treating flush_tlb_fix_spurious_fault() as a no-op. - Perf/PMU driver updates, including addition of the ARM CMN PMU driver and also support to handle CPU PMU IRQs as NMIs. - Allow prefetchable PCI BARs to be exposed to userspace using normal non-cacheable mappings. - Implementation of ARCH_STACKWALK for unwinding. - Improve reporting of unexpected kernel traps due to BPF JIT failure. - Improve robustness of user-visible HWCAP strings and their corresponding numerical constants. - Removal of TEXT_OFFSET. - Removal of some unused functions, parameters and prototypes. - Removal of MPIDR-based topology detection in favour of firmware description. - Cleanups to handling of SVE and FPSIMD register state in preparation for potential future optimisation of handling across syscalls. - Cleanups to the SDEI driver in preparation for support in KVM. - Miscellaneous cleanups and refactoring work. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+AUXMQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNFc1B/4q2Kabe+pPu7s1f58Q+OTaEfqcr3F1qh27 F1YpFZUYxg0GPfPsFrnbJpo5WKo7wdR9ceI9yF/GHjs7A/MSoQJis3pG6SlAd9c0 nMU5tCwhg9wfq6asJtl0/IPWem6cqqhdzC6m808DjeHuyi2CCJTt0vFWH3OeHEhG cfmLfaSNXOXa/MjEkT8y1AXJ/8IpIpzkJeCRA1G5s18PXV9Kl5bafIo9iqyfKPLP 0rJljBmoWbzuCSMc81HmGUQI4+8KRp6HHhyZC/k0WEVgj3LiumT7am02bdjZlTnK BeNDKQsv2Jk8pXP2SlrI3hIUTz0bM6I567FzJEokepvTUzZ+CVBi =9J8H -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's quite a lot of code here, but much of it is due to the addition of a new PMU driver as well as some arm64-specific selftests which is an area where we've traditionally been lagging a bit. In terms of exciting features, this includes support for the Memory Tagging Extension which narrowly missed 5.9, hopefully allowing userspace to run with use-after-free detection in production on CPUs that support it. Work is ongoing to integrate the feature with KASAN for 5.11. Another change that I'm excited about (assuming they get the hardware right) is preparing the ASID allocator for sharing the CPU page-table with the SMMU. Those changes will also come in via Joerg with the IOMMU pull. We do stray outside of our usual directories in a few places, mostly due to core changes required by MTE. Although much of this has been Acked, there were a couple of places where we unfortunately didn't get any review feedback. Other than that, we ran into a handful of minor conflicts in -next, but nothing that should post any issues. Summary: - Userspace support for the Memory Tagging Extension introduced by Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11. - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context switching. - Fix and subsequent rewrite of our Spectre mitigations, including the addition of support for PR_SPEC_DISABLE_NOEXEC. - Support for the Armv8.3 Pointer Authentication enhancements. - Support for ASID pinning, which is required when sharing page-tables with the SMMU. - MM updates, including treating flush_tlb_fix_spurious_fault() as a no-op. - Perf/PMU driver updates, including addition of the ARM CMN PMU driver and also support to handle CPU PMU IRQs as NMIs. - Allow prefetchable PCI BARs to be exposed to userspace using normal non-cacheable mappings. - Implementation of ARCH_STACKWALK for unwinding. - Improve reporting of unexpected kernel traps due to BPF JIT failure. - Improve robustness of user-visible HWCAP strings and their corresponding numerical constants. - Removal of TEXT_OFFSET. - Removal of some unused functions, parameters and prototypes. - Removal of MPIDR-based topology detection in favour of firmware description. - Cleanups to handling of SVE and FPSIMD register state in preparation for potential future optimisation of handling across syscalls. - Cleanups to the SDEI driver in preparation for support in KVM. - Miscellaneous cleanups and refactoring work" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits) Revert "arm64: initialize per-cpu offsets earlier" arm64: random: Remove no longer needed prototypes arm64: initialize per-cpu offsets earlier kselftest/arm64: Check mte tagged user address in kernel kselftest/arm64: Verify KSM page merge for MTE pages kselftest/arm64: Verify all different mmap MTE options kselftest/arm64: Check forked child mte memory accessibility kselftest/arm64: Verify mte tag inclusion via prctl kselftest/arm64: Add utilities and a test to validate mte memory perf: arm-cmn: Fix conversion specifiers for node type perf: arm-cmn: Fix unsigned comparison to less than zero arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option arm64: Pull in task_stack_page() to Spectre-v4 mitigation code KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled arm64: Get rid of arm64_ssbd_state KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state() KVM: arm64: Get rid of kvm_arm_have_ssbd() KVM: arm64: Simplify handling of ARCH_WORKAROUND_2 ... |
||
Vijay Balakrishna
|
4aab2be098 |
mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged. Currently after
hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost.
This change restores min_free_kbytes as expected for THP consumers.
[vijayb@linux.microsoft.com: v5]
Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com
Fixes:
|
||
Miaohe Lin
|
bc4fe4cdd6 |
mm: mmap: Fix general protection fault in unlink_file_vma()
The syzbot reported the below general protection fault: general protection fault, probably for non-canonical address 0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df] CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0 RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164 Call Trace: free_pgtables+0x1b3/0x2f0 mm/memory.c:415 exit_mmap+0x2c0/0x530 mm/mmap.c:3184 __mmput+0x122/0x470 kernel/fork.c:1076 mmput+0x53/0x60 kernel/fork.c:1097 exit_mm kernel/exit.c:483 [inline] do_exit+0xa8b/0x29f0 kernel/exit.c:793 do_group_exit+0x125/0x310 kernel/exit.c:903 get_signal+0x428/0x1f00 kernel/signal.c:2757 arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811 exit_to_user_mode_loop kernel/entry/common.c:136 [inline] exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's because the ->mmap() callback can change vma->vm_file and fput the original file. But the commit |
||
Hugh Dickins
|
033b5d7755 |
mm/khugepaged: fix filemap page_to_pgoff(page) != offset
There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.
Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.
collapse_file(old start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=old offset
new_page->mapping = mapping
xas_store(&xas, new_page)
filemap_fault
page = find_get_page(mapping, offset)
// if offset falls inside hpage then
// compound_head(page) == hpage
lock_page_maybe_drop_mmap()
__lock_page(page)
// collapse fails
xas_store(&xas, old page)
new_page->mapping = NULL
unlock_page(new_page)
collapse_file(new start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=new offset
new_page->mapping = mapping // mapping becomes valid again
// since compound_head(page) == hpage
// page_to_pgoff(page) got changed
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above. But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.
Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).
Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes:
|
||
Ingo Molnar
|
e705d39796 |
Merge branch 'locking/urgent' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
f3c64eda3e |
mm: avoid early COW write protect games during fork()
In commit |
||
Joonsoo Kim
|
1d91df85f3 |
mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs
memalloc_nocma_{save/restore} APIs can be used to skip page allocation
on CMA area, but, there is a missing case and the page on CMA area could
be allocated even if APIs are used. This patch handles this case to fix
the potential issue.
For now, these APIs are used to prevent long-term pinning on the CMA
page. When the long-term pinning is requested on the CMA page, it is
migrated to the non-CMA page before pinning. This non-CMA page is
allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't
work as intended, the CMA page is allocated and it is pinned for a long
time. This long-term pin for the CMA page causes cma_alloc() failure
and it could result in wrong behaviour on the device driver who uses the
cma_alloc().
Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
could have the pages on CMA area so we need to skip it if ALLOC_CMA
isn't specified.
Fixes:
|
||
Eric Farman
|
484cfaca95 |
mm, slub: restore initial kmem_cache flags
The routine that applies debug flags to the kmem_cache slabs
inadvertantly prevents non-debug flags from being applied to those
same objects. That is, if slub_debug=<flag>,<slab> is specified,
non-debugged slabs will end up having flags of zero, and the slabs
may be unusable.
Fix this by including the input flags for non-matching slabs with the
contents of slub_debug, so that the caches are created as expected
alongside any debugging options that may be requested. With this, we
can remove the check for a NULL slub_debug_string, since it's covered
by the loop itself.
Fixes:
|
||
Christoph Hellwig
|
c3973b401e |
mm: remove compat_process_vm_{readv,writev}
Now that import_iovec handles compat iovecs, the native syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Christoph Hellwig
|
89cd35c58b |
iov_iter: transparently handle compat iovecs in import_iovec
Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Christoph Hellwig
|
bfdc59701d |
iov_iter: refactor rw_copy_check_uvector and import_iovec
Split rw_copy_check_uvector into two new helpers with more sensible calling conventions: - iovec_from_user copies a iovec from userspace either into the provided stack buffer if it fits, or allocates a new buffer for it. Returns the actually used iovec. It also verifies that iov_len does fit a signed type, and handles compat iovecs if the compat flag is set. - __import_iovec consolidates the native and compat versions of import_iovec. It calls iovec_from_user, then validates each iovec actually points to user addresses, and ensures the total length doesn't overflow. This has two major implications: - the access_process_vm case loses the total lenght checking, which wasn't required anyway, given that each call receives two iovecs for the local and remote side of the operation, and it verifies the total length on the local side already. - instead of a single loop there now are two loops over the iovecs. Given that the iovecs are cache hot this doesn't make a major difference Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Linus Torvalds
|
702bfc891d |
io_uring-5.9-2020-10-02
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl93Z48QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmp4EACwxi4UVnL0zhaOBmXfqxDuaXViwkfVZNxx d40y+DcCewnpZMk2G9cES8OKG+Tu2GFX2yl1m2XdrIWJ6jpnGFKJOkNQGfPDQrT3 fI7qFrEDeSVeLUMMBxtvZLW8w2D0KcNCgla4h/ESXI9xtPTZdYXhYQY0zfuWalUC ZplUgAWlHx82qJari7ZmIfeVtpAoujTvkccRe+/RtPv5vO+UsvP7kqPSCYMGqhHS 7z5gK3Nw+PNMWrzZVZ6Rw5nLeExx9PJGgiEkitEjn7mRJELXV9eWnTt9D0eVwaec WO7OSQmrJLmMFER4ZhkDNJkXZFvlYUCygnwJQmH70LflRqUEA00O6wX4J32O3NIg fIDWKMGGANFU5atL+RHqfQgUYq0GY1UsIvZxJnwRwv1QssmJoQq9fpT6VYqiQMik 2JAeWyMqTGI4vRNmVJKTR/13SpRUYrvS3wHN53kCaBBhE5Y/vFksgOGgXZBG/TPk odpegeJOTa5xuS0YcKIK6yL/xHENct1Y1BtVjczrXKJz0E90n5ZdIR0lEg6Ij3B1 jZUwKiS2sY09eBaJIQvtD4hIaw5VgqtwinKTyt7MBw/6pCqJpSZtaV0Uvgvjq/Se 1ifUo4cWwQBccZLgWeWoEalio2fNIyb+J+sm7eu9Xygjl67U2M8oMfAN2JjkM7As btLazer4lg== =fo3Z -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - fix for async buffered reads if read-ahead is fully disabled (Hao) - double poll match fix - ->show_fdinfo() potential ABBA deadlock complaint fix * tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block: io_uring: fix async buffered reads when readahead is disabled io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: always delete double poll wait entry on match |
||
Hao Xu
|
c8d317aa18 |
io_uring: fix async buffered reads when readahead is disabled
The async buffered reads feature is not working when readahead is turned off. There are two things to concern: - when doing retry in io_read, not only the IOCB_WAITQ flag but also the IOCB_NOWAIT flag is still set, which makes it goes to would_block phase in generic_file_buffered_read() and then return -EAGAIN. After that, the io-wq thread work is queued, and later doing the async reads in the old way. - even if we remove IOCB_NOWAIT when doing retry, the feature is still not running properly, since in generic_file_buffered_read() it goes to lock_page_killable() after calling mapping->a_ops->readpage() to do IO, and thus causing process to sleep. Fixes: |
||
Jason A. Donenfeld
|
a4d63c3732 |
mm: do not rely on mm == current->mm in __get_user_pages_locked
It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:
pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked
Before, this would lead to an OOPS:
BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064
This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.
Fixes:
|
||
Peter Xu
|
d042035eaf |
mm/thp: Split huge pmds/puds if they're pinned when fork()
Pinned pages shouldn't be write-protected when fork() happens, because follow up copy-on-write on these pages could cause the pinned pages to be replaced by random newly allocated pages. For huge PMDs, we split the huge pmd if pinning is detected. So that future handling will be done by the PTE level (with our latest changes, each of the small pages will be copied). We can achieve this by let copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and finally land the next copy_pte_range() call. Huge PUDs will be even more special - so far it does not support anonymous pages. But it can actually be done the same as the huge PMDs even if the split huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault ins will remap the same pages in either parent/child later. This might not be the most efficient way, but it should be easy and clean enough. It should be fine, since we're tackling with a very rare case just to make sure userspaces that pinned some thps will still work even without MADV_DONTFORK and after they fork()ed. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Peter Xu
|
70e806e4e6 |
mm: Do early cow for pinned pages during fork() for ptes
This allows copy_pte_range() to do early cow if the pages were pinned on the source mm. Currently we don't have an accurate way to know whether a page is pinned or not. The only thing we have is page_maybe_dma_pinned(). However that's good enough for now. Especially, with the newly added mm->has_pinned flag to make sure we won't affect processes that never pinned any pages. It would be easier if we can do GFP_KERNEL allocation within copy_one_pte(). Unluckily, we can't because we're with the page table locks held for both the parent and child processes. So the page allocation needs to be done outside copy_one_pte(). Some trick is there in copy_present_pte(), majorly the wrprotect trick to block concurrent fast-gup. Comments in the function should explain better in place. Oleg Nesterov reported a (probably harmless) bug during review that we didn't reset entry.val properly in copy_pte_range() so that potentially there's chance to call add_swap_count_continuation() multiple times on the same swp entry. However that should be harmless since even if it happens, the same function (add_swap_count_continuation()) will return directly noticing that there're enough space for the swp counter. So instead of a standalone stable patch, it is touched up in this patch directly. Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Peter Xu
|
7a4830c380 |
mm/fork: Pass new vma pointer into copy_page_range()
This prepares for the future work to trigger early cow on pinned pages during fork(). No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Peter Xu
|
008cfe4418 |
mm: Introduce mm_struct.has_pinned
(Commit message majorly collected from Jason Gunthorpe) Reduce the chance of false positive from page_maybe_dma_pinned() by keeping track if the mm_struct has ever been used with pin_user_pages(). This allows cases that might drive up the page ref_count to avoid any penalty from handling dma_pinned pages. Future work is planned, to provide a more sophisticated solution, likely to turn it into a real counter. For now, make it atomic_t but use it as a boolean for simplicity. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
8fb1e91033 |
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton: "9 patches. Subsystems affected by this patch series: mm (thp, memcg, gup, migration, memory-hotplug), lib, and x86" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: don't rely on system state to detect hot-plug operations mm: replace memmap_context by meminit_context arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback lib/memregion.c: include memregion.h lib/string.c: implement stpcpy mm/migrate: correct thp migration stats mm/gup: fix gup_fast with dynamic page table folding mm: memcontrol: fix missing suffix of workingset_restore mm, THP, swap: fix allocating cluster for swapfile by mistake |
||
Minchan Kim
|
ce2684254b |
mm: validate pmd after splitting
syzbot reported the following KASAN splat:
general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
Call Trace:
lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
walk_pmd_range mm/pagewalk.c:89 [inline]
walk_pud_range mm/pagewalk.c:160 [inline]
walk_p4d_range mm/pagewalk.c:193 [inline]
walk_pgd_range mm/pagewalk.c:229 [inline]
__walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
madvise_pageout_page_range mm/madvise.c:521 [inline]
madvise_pageout mm/madvise.c:557 [inline]
madvise_vma mm/madvise.c:946 [inline]
do_madvise+0x12d0/0x2090 mm/madvise.c:1145
__do_sys_madvise mm/madvise.c:1171 [inline]
__se_sys_madvise mm/madvise.c:1169 [inline]
__x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The backing vma was shmem.
In case of split page of file-backed THP, madvise zaps the pmd instead
of remapping of sub-pages. So we need to check pmd validity after
split.
Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
Fixes:
|
||
Laurent Dufour
|
f85086f95f |
mm: don't rely on system state to detect hot-plug operations
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.
The consequence is that on system with interleaved node's ranges like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:
$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.
[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:
$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
Fixes:
|
||
Laurent Dufour
|
c1d0da8335 |
mm: replace memmap_context by meminit_context
Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Zi Yan
|
6c5c7b9f33 |
mm/migrate: correct thp migration stats
PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations. Exclude hugetlb migration by
setting is_thp variable right.
Clean up thp handling code too when we are there.
Fixes:
|
||
Vasily Gorbik
|
d3f7b1bb20 |
mm/gup: fix gup_fast with dynamic page table folding
Currently to make sure that every page table entry is read just once gup_fast walks perform READ_ONCE and pass pXd value down to the next gup_pXd_range function by value e.g.: static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) ... pudp = pud_offset(&p4d, addr); This function passes a reference on that local value copy to pXd_offset, and might get the very same pointer in return. This happens when the level is folded (on most arches), and that pointer should not be iterated. On s390 due to the fact that each task might have different 5,4 or 3-level address translation and hence different levels folded the logic is more complex and non-iteratable pointer to a local copy leads to severe problems. Here is an example of what happens with gup_fast on s390, for a task with 3-level paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } This happens since s390 moved to common gup code with commit |
||
Muchun Song
|
8d3fe09d8d |
mm: memcontrol: fix missing suffix of workingset_restore
We forget to add the suffix to the workingset_restore string, so fix it.
And also update the documentation of cgroup-v2.rst.
Fixes:
|
||
Gao Xiang
|
4166343058 |
mm, THP, swap: fix allocating cluster for swapfile by mistake
SWP_FS is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS. So, !SWP_FS means non NFS for now, it could be either file backed or device backed. Something similar goes with legacy SWP_FILE. So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead. FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y. I reproduced the issue with the following details: Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB) Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y Some reproducible steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw mkswap /tmp/mnt/sw swapon /tmp/mnt/sw stress --vm 2 --vm-bytes 600M # doesn't matter too much as well Symptoms: - FS corruption (e.g. checksum failure) - memory corruption at: 0xd2808010 - segfault Fixes: |