mirror of
https://github.com/torvalds/linux.git
synced 2024-11-11 14:42:24 +00:00
1686766493
1579 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Roman Gushchin
|
1686766493 |
mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations
Currently a cma area is barely used by the page allocator because it's used only as a fallback from movable, however kswapd tries hard to make sure that the fallback path isn't used. This results in a system evicting memory and pushing data into swap, while lots of CMA memory is still available. This happens despite the fact that alloc_contig_range is perfectly capable of moving any movable allocations out of the way of an allocation. To effectively use the cma area let's alter the rules: if the zone has more free cma pages than the half of total free pages in the zone, use cma pageblocks first and fallback to movable blocks in the case of failure. [guro@fb.com: ifdef the cma-specific code] Link: http://lkml.kernel.org/r/20200311225832.GA178154@carbon.DHCP.thefacebook.com Co-developed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Joonsoo Kim <js1304@gmail.com> Link: http://lkml.kernel.org/r/20200306150102.3e77354b@imladris.surriel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Wei Yang
|
58b7f1194f |
mm/page_alloc.c: extract check_[new|free]_page_bad() common part to page_bad_reason()
We share similar code in check_[new|free]_page_bad() to get the page's bad reason. Let's extract it and reduce code duplication. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200411220357.9636-6-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Wei Yang
|
534fe5e3c4 |
mm/page_alloc.c: rename free_pages_check() to check_free_page()
free_pages_check() is the counterpart of check_new_page(). Rename it to use the same naming convention. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Wei Yang
|
0d0c48a274 |
mm/page_alloc.c: rename free_pages_check_bad() to check_free_page_bad()
free_pages_check_bad() is the counterpart of check_new_page_bad(). Rename it to use the same naming convention. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200411220357.9636-4-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Wei Yang
|
82a3241a8f |
mm/page_alloc.c: bad_flags is not necessary for bad_page()
After commit
|
||
Wei Yang
|
833d8a426f |
mm/page_alloc.c: bad_[reason|flags] is not necessary when PageHWPoison
Patch series "mm/page_alloc.c: cleanup on check page", v3.
This patchset does some cleanup related to check page.
1. Remove unnecessary bad_reason assignment
2. Remove bad_flags to bad_page()
3. Rename function for naming convention
4. Extract common part to check page
Thanks for suggestions from David Rientjes and Anshuman Khandual.
This patch (of 5):
Since function returns directly, bad_[reason|flags] is not used any where.
And move this to the first.
This is a following cleanup for commit
|
||
Mike Rapoport
|
8a1b25fe3c |
mm: simplify find_min_pfn_with_active_regions()
find_min_pfn_with_active_regions() calls find_min_pfn_for_node() with nid parameter set to MAX_NUMNODES. This makes the find_min_pfn_for_node() traverse all memblock memory regions although the first PFN in the system can be easily found with memblock_start_of_DRAM(). Use memblock_start_of_DRAM() in find_min_pfn_with_active_regions() and drop now unused find_min_pfn_for_node(). Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Cc: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-21-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
854e8848c5 |
mm: clean up free_area_init_node() and its helpers
free_area_init_node() now always uses memblock info and the zone PFN limits so it does not need the backwards compatibility functions to calculate the zone spanned and absent pages. The removal of the compat_ versions of zone_{abscent,spanned}_pages_in_node() in turn, makes zone_size and zhole_size parameters unused. The node_start_pfn is determined by get_pfn_range_for_nid(), so there is no need to pass it to free_area_init_node(). As a result, the only required parameter to free_area_init_node() is the node ID, all the rest are removed along with no longer used compat_zone_{abscent,spanned}_pages_in_node() helpers. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Cc: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-20-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
bc9331a19d |
mm: rename free_area_init_node() to free_area_init_memoryless_node()
free_area_init_node() is only used by x86 to initialize a memory-less nodes. Make its name reflect this and drop all the function parameters except node ID as they are anyway zero. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Cc: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-19-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
51930df580 |
mm: free_area_init: allow defining max_zone_pfn in descending order
Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it is sorted in descending order allows using free_area_init() on such architectures. Add top -> down traversal of max_zone_pfn array in free_area_init() and use the latter in ARC node/zone initialization. [rppt@kernel.org: ARC fix] Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org [rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode] Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com [akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()] Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Guenter Roeck <linux@roeck-us.net> Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
acd3f5c441 |
mm: remove early_pfn_in_nid() and CONFIG_NODES_SPAN_OTHER_NODES
The memmap_init() function was made to iterate over memblock regions and as the result the early_pfn_in_nid() function became obsolete. Since CONFIG_NODES_SPAN_OTHER_NODES is only used to pick a stub or a real implementation of early_pfn_in_nid(), it is also not needed anymore. Remove both early_pfn_in_nid() and the CONFIG_NODES_SPAN_OTHER_NODES. Co-developed-by: Hoan Tran <Hoan@os.amperecomputing.com> Signed-off-by: Hoan Tran <Hoan@os.amperecomputing.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Cc: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-17-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Baoquan He
|
73a6e474cb |
mm: memmap_init: iterate over memblock regions rather that check each PFN
When called during boot the memmap_init_zone() function checks if each PFN is valid and actually belongs to the node being initialized using early_pfn_valid() and early_pfn_in_nid(). Each such check may cost up to O(log(n)) where n is the number of memory banks, so for large amount of memory overall time spent in early_pfn*() becomes substantial. Since the information is anyway present in memblock, we can iterate over memblock memory regions in memmap_init() and only call memmap_init_zone() for PFN ranges that are know to be valid and in the appropriate node. [cai@lca.pw: fix a compilation warning from Clang] Link: http://lkml.kernel.org/r/CF6E407F-17DC-427C-8203-21979FB882EF@lca.pw [bhe@redhat.com: fix the incorrect hole in fast_isolate_freepages()] Link: http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw Link: http://lkml.kernel.org/r/20200521014407.29690-1-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200412194859.12663-16-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
9691a071aa |
mm: use free_area_init() instead of free_area_init_nodes()
free_area_init() has effectively became a wrapper for free_area_init_nodes() and there is no point of keeping it. Still free_area_init() name is shorter and more general as it does not imply necessity to initialize multiple nodes. Rename free_area_init_nodes() to free_area_init(), update the callers and drop old version of free_area_init(). Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
fa3354e4ea |
mm: free_area_init: use maximal zone PFNs rather than zone sizes
Currently, architectures that use free_area_init() to initialize memory map and node and zone structures need to calculate zone and hole sizes. We can use free_area_init_nodes() instead and let it detect the zone boundaries while the architectures will only have to supply the possible limits for the zones. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-5-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
3f08a302f5 |
mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option
CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of nodes and zones structures between the systems that have region to node mapping in memblock and those that don't. Currently all the NUMA architectures enable this option and for the non-NUMA systems we can presume that all the memory belongs to node 0 and therefore the compile time configuration option is not required. The remaining few architectures that use DISCONTIGMEM without NUMA are easily updated to use memblock_add_node() instead of memblock_add() and thus have proper correspondence of memblock regions to NUMA nodes. Still, free_area_init_node() must have a backward compatible version because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is different. Once all the architectures will use the new semantics, the entire compatibility layer can be dropped. To avoid addition of extra run time memory to store node id for architectures that keep memblock but have only a single node, the node id field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and the corresponding accessors presume that in those cases it is always 0. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
6f24fbd38c |
mm: make early_pfn_to_nid() and related defintions close to each other
early_pfn_to_nid() and its helper __early_pfn_to_nid() are spread around include/linux/mm.h, include/linux/mmzone.h and mm/page_alloc.c. Drop unused stub for __early_pfn_to_nid() and move its actual generic implementation close to its users. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-3-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Rapoport
|
d622abf74f |
mm: memblock: replace dereferences of memblock_region.nid with API calls
Patch series "mm: rework free_area_init*() funcitons". After the discussion [1] about removal of CONFIG_NODES_SPAN_OTHER_NODES and CONFIG_HAVE_MEMBLOCK_NODE_MAP options, I took it a bit further and updated the node/zone initialization. Since all architectures have memblock, it is possible to use only the newer version of free_area_init_node() that calculates the zone and node boundaries based on memblock node mapping and architectural limits on possible zone PFNs. The architectures that still determined zone and hole sizes can be switched to the generic code and the old code that took those zone and hole sizes can be simply removed. And, since it all started from the removal of CONFIG_NODES_SPAN_OTHER_NODES, the memmap_init() is now updated to iterate over memblocks and so it does not need to perform early_pfn_to_nid() query for every PFN. [1] https://lore.kernel.org/lkml/1585420282-25630-1-git-send-email-Hoan@os.amperecomputing.com This patch (of 21): There are several places in the code that directly dereference memblock_region.nid despite this field being defined only when CONFIG_HAVE_MEMBLOCK_NODE_MAP=y. Replace these with calls to memblock_get_region_nid() to improve code robustness and to avoid possible breakage when CONFIG_HAVE_MEMBLOCK_NODE_MAP will be removed. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64] Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200412194859.12663-2-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
94709049fb |
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton: "A few little subsystems and a start of a lot of MM patches. Subsystems affected by this patch series: squashfs, ocfs2, parisc, vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup, swap, memcg, pagemap, memory-failure, vmalloc, kasan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits) kasan: move kasan_report() into report.c mm/mm_init.c: report kasan-tag information stored in page->flags ubsan: entirely disable alignment checks under UBSAN_TRAP kasan: fix clang compilation warning due to stack protector x86/mm: remove vmalloc faulting mm: remove vmalloc_sync_(un)mappings() x86/mm/32: implement arch_sync_kernel_mappings() x86/mm/64: implement arch_sync_kernel_mappings() mm/ioremap: track which page-table levels were modified mm/vmalloc: track which page-table levels were modified mm: add functions to track page directory modifications s390: use __vmalloc_node in stack_alloc powerpc: use __vmalloc_node in alloc_vm_stack arm64: use __vmalloc_node in arch_alloc_vmap_stack mm: remove vmalloc_user_node_flags mm: switch the test_vmalloc module to use __vmalloc_node mm: remove __vmalloc_node_flags_caller mm: remove both instances of __vmalloc_node_flags mm: remove the prot argument to __vmalloc_node mm: remove the pgprot argument to __vmalloc ... |
||
Christoph Hellwig
|
88dca4ca5a |
mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv] Acked-by: Gao Xiang <xiang@kernel.org> [erofs] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
NeilBrown
|
8d92890bd6 |
mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.
These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit
|
||
Linus Torvalds
|
533b220f7b |
arm64 updates for 5.8
- Branch Target Identification (BTI) * Support for ARMv8.5-BTI in both user- and kernel-space. This allows branch targets to limit the types of branch from which they can be called and additionally prevents branching to arbitrary code, although kernel support requires a very recent toolchain. * Function annotation via SYM_FUNC_START() so that assembly functions are wrapped with the relevant "landing pad" instructions. * BPF and vDSO updates to use the new instructions. * Addition of a new HWCAP and exposure of BTI capability to userspace via ID register emulation, along with ELF loader support for the BTI feature in .note.gnu.property. * Non-critical fixes to CFI unwind annotations in the sigreturn trampoline. - Shadow Call Stack (SCS) * Support for Clang's Shadow Call Stack feature, which reserves platform register x18 to point at a separate stack for each task that holds only return addresses. This protects function return control flow from buffer overruns on the main stack. * Save/restore of x18 across problematic boundaries (user-mode, hypervisor, EFI, suspend, etc). * Core support for SCS, should other architectures want to use it too. * SCS overflow checking on context-switch as part of the existing stack limit check if CONFIG_SCHED_STACK_END_CHECK=y. - CPU feature detection * Removed numerous "SANITY CHECK" errors when running on a system with mismatched AArch32 support at EL1. This is primarily a concern for KVM, which disabled support for 32-bit guests on such a system. * Addition of new ID registers and fields as the architecture has been extended. - Perf and PMU drivers * Minor fixes and cleanups to system PMU drivers. - Hardware errata * Unify KVM workarounds for VHE and nVHE configurations. * Sort vendor errata entries in Kconfig. - Secure Monitor Call Calling Convention (SMCCC) * Update to the latest specification from Arm (v1.2). * Allow PSCI code to query the SMCCC version. - Software Delegated Exception Interface (SDEI) * Unexport a bunch of unused symbols. * Minor fixes to handling of firmware data. - Pointer authentication * Add support for dumping the kernel PAC mask in vmcoreinfo so that the stack can be unwound by tools such as kdump. * Simplification of key initialisation during CPU bringup. - BPF backend * Improve immediate generation for logical and add/sub instructions. - vDSO - Minor fixes to the linker flags for consistency with other architectures and support for LLVM's unwinder. - Clean up logic to initialise and map the vDSO into userspace. - ACPI - Work around for an ambiguity in the IORT specification relating to the "num_ids" field. - Support _DMA method for all named components rather than only PCIe root complexes. - Minor other IORT-related fixes. - Miscellaneous * Initialise debug traps early for KGDB and fix KDB cacheflushing deadlock. * Minor tweaks to early boot state (documentation update, set TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections). * Refactoring and cleanup -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q =w3qi -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "A sizeable pile of arm64 updates for 5.8. Summary below, but the big two features are support for Branch Target Identification and Clang's Shadow Call stack. The latter is currently arm64-only, but the high-level parts are all in core code so it could easily be adopted by other architectures pending toolchain support Branch Target Identification (BTI): - Support for ARMv8.5-BTI in both user- and kernel-space. This allows branch targets to limit the types of branch from which they can be called and additionally prevents branching to arbitrary code, although kernel support requires a very recent toolchain. - Function annotation via SYM_FUNC_START() so that assembly functions are wrapped with the relevant "landing pad" instructions. - BPF and vDSO updates to use the new instructions. - Addition of a new HWCAP and exposure of BTI capability to userspace via ID register emulation, along with ELF loader support for the BTI feature in .note.gnu.property. - Non-critical fixes to CFI unwind annotations in the sigreturn trampoline. Shadow Call Stack (SCS): - Support for Clang's Shadow Call Stack feature, which reserves platform register x18 to point at a separate stack for each task that holds only return addresses. This protects function return control flow from buffer overruns on the main stack. - Save/restore of x18 across problematic boundaries (user-mode, hypervisor, EFI, suspend, etc). - Core support for SCS, should other architectures want to use it too. - SCS overflow checking on context-switch as part of the existing stack limit check if CONFIG_SCHED_STACK_END_CHECK=y. CPU feature detection: - Removed numerous "SANITY CHECK" errors when running on a system with mismatched AArch32 support at EL1. This is primarily a concern for KVM, which disabled support for 32-bit guests on such a system. - Addition of new ID registers and fields as the architecture has been extended. Perf and PMU drivers: - Minor fixes and cleanups to system PMU drivers. Hardware errata: - Unify KVM workarounds for VHE and nVHE configurations. - Sort vendor errata entries in Kconfig. Secure Monitor Call Calling Convention (SMCCC): - Update to the latest specification from Arm (v1.2). - Allow PSCI code to query the SMCCC version. Software Delegated Exception Interface (SDEI): - Unexport a bunch of unused symbols. - Minor fixes to handling of firmware data. Pointer authentication: - Add support for dumping the kernel PAC mask in vmcoreinfo so that the stack can be unwound by tools such as kdump. - Simplification of key initialisation during CPU bringup. BPF backend: - Improve immediate generation for logical and add/sub instructions. vDSO: - Minor fixes to the linker flags for consistency with other architectures and support for LLVM's unwinder. - Clean up logic to initialise and map the vDSO into userspace. ACPI: - Work around for an ambiguity in the IORT specification relating to the "num_ids" field. - Support _DMA method for all named components rather than only PCIe root complexes. - Minor other IORT-related fixes. Miscellaneous: - Initialise debug traps early for KGDB and fix KDB cacheflushing deadlock. - Minor tweaks to early boot state (documentation update, set TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections). - Refactoring and cleanup" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits) KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h KVM: arm64: Check advertised Stage-2 page size capability arm64/cpufeature: Add get_arm64_ftr_reg_nowarn() ACPI/IORT: Remove the unused __get_pci_rid() arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register arm64/cpufeature: Add remaining feature bits in ID_PFR0 register arm64/cpufeature: Introduce ID_MMFR5 CPU register arm64/cpufeature: Introduce ID_DFR1 CPU register arm64/cpufeature: Introduce ID_PFR2 CPU register arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0 arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register arm64: mm: Add asid_gen_match() helper firmware: smccc: Fix missing prototype warning for arm_smccc_version_init arm64: vdso: Fix CFI directives in sigreturn trampoline arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction ... |
||
Sami Tolvanen
|
628d06a48f |
scs: Add page accounting for shadow call stack allocations
This change adds accounting for the memory allocated for shadow stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> |
||
Henry Willard
|
14f69140ff |
mm: limit boost_watermark on small zones
Commit |
||
David Hildenbrand
|
e84fe99b68 |
mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jason Yan
|
8b885f53b0 |
mm/page_alloc: make pcpu_drain_mutex and pcpu_drain static
Fix the following sparse warning: mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static? mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Randy Dunlap
|
e6a0a7ad1c |
mm/page_alloc.c: fix kernel-doc warning
Add description of function parameter 'mt' to fix kernel-doc warning: mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alexander Duyck
|
36e66c554b |
mm: introduce Reported pages
In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alexander Duyck
|
624f58d8f4 |
mm: add function __putback_isolated_page
There are cases where we would benefit from avoiding having to go through the allocation and free cycle to return an isolated page. Examples for this might include page poisoning in which we isolate a page and then put it back in the free list without ever having actually allocated it. This will enable us to also avoid notifiers for the future free page reporting which will need to avoid retriggering page reporting when returning pages that have been reported on. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alexander Duyck
|
6ab0136310 |
mm: use zone and order instead of free area in free_list manipulators
In order to enable the use of the zone from the list manipulator functions I will need access to the zone pointer. As it turns out most of the accessors were always just being directly passed &zone->free_area[order] anyway so it would make sense to just fold that into the function itself and pass the zone and order as arguments instead of the free area. In order to be able to reference the zone we need to move the declaration of the functions down so that we have the zone defined before we define the list manipulation functions. Since the functions are only used in the file mm/page_alloc.c we can just move them there to reduce noise in the header. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alexander Duyck
|
a2129f2479 |
mm: adjust shuffle code to allow for future coalescing
Patch series "mm / virtio: Provide support for free page reporting", v17. This series provides an asynchronous means of reporting free guest pages to a hypervisor so that the memory associated with those pages can be dropped and reused by other processes and/or guests on the host. Using this it is possible to avoid unnecessary I/O to disk and greatly improve performance in the case of memory overcommit on the host. When enabled we will be performing a scan of free memory every 2 seconds while pages of sufficiently high order are being freed. In each pass at least one sixteenth of each free list will be reported. By doing this we avoid racing against other threads that may be causing a high amount of memory churn. The lowest page order currently scanned when reporting pages is pageblock_order so that this feature will not interfere with the use of Transparent Huge Pages in the case of virtualization. Currently this is only in use by virtio-balloon however there is the hope that at some point in the future other hypervisors might be able to make use of it. In the virtio-balloon/QEMU implementation the hypervisor is currently using MADV_DONTNEED to indicate to the host kernel that the page is currently free. It will be zeroed and faulted back into the guest the next time the page is accessed. To track if a page is reported or not the Uptodate flag was repurposed and used as a Reported flag for Buddy pages. We walk though the free list isolating pages and adding them to the scatterlist until we either encounter the end of the list or have processed at least one sixteenth of the pages that were listed in nr_free prior to us starting. If we fill the scatterlist before we reach the end of the list we rotate the list so that the first unreported page we encounter is moved to the head of the list as that is where we will resume after we have freed the reported pages back into the tail of the list. Below are the results from various benchmarks. I primarily focused on two tests. The first is the will-it-scale/page_fault2 test, and the other is a modified version of will-it-scale/page_fault1 that was enabled to use THP. I did this as it allows for better visibility into different parts of the memory subsystem. The guest is running with 32G for RAM on one node of a E5-2630 v3. The host has had some features such as CPU turbo disabled in the BIOS. Test page_fault1 (THP) page_fault2 Name tasks Process Iter STDEV Process Iter STDEV Baseline 1 1012402.50 0.14% 361855.25 0.81% 16 8827457.25 0.09% 3282347.00 0.34% Patches Applied 1 1007897.00 0.23% 361887.00 0.26% 16 8784741.75 0.39% 3240669.25 0.48% Patches Enabled 1 1010227.50 0.39% 359749.25 0.56% 16 8756219.00 0.24% 3226608.75 0.97% Patches Enabled 1 1050982.00 4.26% 357966.25 0.14% page shuffle 16 8672601.25 0.49% 3223177.75 0.40% Patches enabled 1 1003238.00 0.22% 360211.00 0.22% shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71% The results above are for a baseline with a linux-next-20191219 kernel, that kernel with this patch set applied but page reporting disabled in virtio-balloon, the patches applied and page reporting fully enabled, the patches enabled with page shuffling enabled, and the patches applied with page shuffling enabled and an RFC patch that makes used of MADV_FREE in QEMU. These results include the deviation seen between the average value reported here versus the high and/or low value. I observed that during the test memory usage for the first three tests never dropped whereas with the patches fully enabled the VM would drop to using only a few GB of the host's memory when switching from memhog to page fault tests. Any of the overhead visible with this patch set enabled seems due to page faults caused by accessing the reported pages and the host zeroing the page before giving it back to the guest. This overhead is much more visible when using THP than with standard 4K pages. In addition page shuffling seemed to increase the amount of faults generated due to an increase in memory churn. The overehad is reduced when using MADV_FREE as we can avoid the extra zeroing of the pages when they are reintroduced to the host, as can be seen when the RFC is applied with shuffling enabled. The overall guest size is kept fairly small to only a few GB while the test is running. If the host memory were oversubscribed this patch set should result in a performance improvement as swapping memory in the host can be avoided. A brief history on the background of free page reporting can be found at: https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/ This patch (of 9): Move the head/tail adding logic out of the shuffle code and into the __free_one_page function since ultimately that is where it is really needed anyway. By doing this we should be able to reduce the overhead and can consolidate all of the list addition bits in one spot. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Rik van Riel
|
1da2f328fa |
mm,thp,compaction,cma: allow THP migration for CMA allocations
The code to implement THP migrations already exists, and the code for CMA to clear out a region of memory already exists. Only a few small tweaks are needed to allow CMA to move THP memory when attempting an allocation from alloc_contig_range. With these changes, migrating THPs from a CMA area works when allocating a 1GB hugepage from CMA memory. [riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil] Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Joonsoo Kim <js1304@gmail.com> Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Rik van Riel
|
b06eda091e |
mm,compaction,cma: add alloc_contig flag to compact_control
Patch series "fix THP migration for CMA allocations", v2. Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in CMA memory blocks. Transparent huge pages also have most of the infrastructure in place to allow migration. However, a few pieces were missing, causing THP migration to fail when attempting to use CMA to allocate 1GB hugepages. With these patches in place, THP migration from CMA blocks seems to work, both for anonymous THPs and for tmpfs/shmem THPs. This patch (of 2): Add information to struct compact_control to indicate that the allocator would really like to clear out this specific part of memory, used by for example CMA. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Joonsoo Kim <js1304@gmail.com> Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
chenqiwu
|
fe925c0cb0 |
mm/page_alloc: simplify page_is_buddy() for better code readability
Simplify page_is_buddy() to reduce the redundant code for better code readability. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mateusz Nosek
|
97ce86f93c |
mm/page_alloc.c: micro-optimisation Remove unnecessary branch
Previously if branch condition was false, the assignment was not executed. The assignment can be safely executed even when the condition is false and it is not incorrect as it assigns the value of 'nodemask' to 'ac.nodemask' which already has the same value. So as the assignment can be executed unconditionally, the branch can be removed. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
chenqiwu
|
76089d0082 |
mm/page_alloc.c: use free_area_empty() instead of open-coding
Use free_area_empty() API to replace list_empty() for better code readability. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/1583674354-7713-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mateusz Nosek
|
736838e964 |
mm, pagealloc: micro-optimisation: save two branches on hot page allocation path
This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int). Thanks to that code like: if (gfp_mask & __GFP_KSWAPD_RECLAIM) alloc_flags |= ALLOC_KSWAPD; can be changed to: alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM); Thanks to this one branch less is generated in the assembly. In case of ALLOC_KSWAPD flag two branches are saved, first one in code that always executes in the beginning of page allocation and the second one in loop in page allocator slowpath. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Joel Savitz
|
ee8eb9a5fe |
mm/page_alloc: increase default min_free_kbytes bound
Currently, the vm.min_free_kbytes sysctl value is capped at a hardcoded 64M in init_per_zone_wmark_min (unless it is overridden by khugepaged initialization). This value has not been modified since 2005, and enterprise-grade systems now frequently have hundreds of GB of RAM and multiple 10, 40, or even 100 GB NICs. We have seen page allocation failures on heavily loaded systems related to NIC drivers. These issues were resolved by an increase to vm.min_free_kbytes. This patch increases the hardcoded value by a factor of 4 as a temporary solution. Further work to make the calculation of vm.min_free_kbytes more consistent throughout the kernel would be desirable. As an example of the inconsistency of the current method, this value is recalculated by init_per_zone_wmark_min() in the case of memory hotplug which will override the value set by set_recommended_min_free_kbytes() called during khugepaged initialization even if khugepaged remains enabled, however an on/off toggle of khugepaged will then recalculate and set the value via set_recommended_min_free_kbytes(). Signed-off-by: Joel Savitz <jsavitz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael Aquini <aquini@redhat.com> Link: http://lkml.kernel.org/r/20200220150103.5183-1-jsavitz@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Roman Gushchin
|
f4b00eab50 |
mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
47e29d32af |
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vlastimil Babka
|
667c790169 |
revert "topology: add support for node_to_mem_node() to determine the fallback node"
This reverts commit |
||
Aneesh Kumar K.V
|
1f8d75c1b7 |
mm/memmap_init: update variable name in memmap_init_zone
Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6. This series fixes the access of uninitialized memmaps when shrinking zones/nodes and when removing memory. Also, it contains all fixes for crashes that can be triggered when removing certain namespace using memunmap_pages() - ZONE_DEVICE, reported by Aneesh. We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be more involved (we don't have SECTION_IS_ONLINE as an indicator), and shrinking is only of limited use (set_zone_contiguous() cannot detect the ZONE_DEVICE as contiguous). We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of code to a minimum. Shrinking is especially necessary to keep zone->contiguous set where possible, especially, on memory unplug of DIMMs at zone boundaries. -------------------------------------------------------------------------- Zones are now properly shrunk when offlining memory blocks or when onlining failed. This allows to properly shrink zones on memory unplug even if the separate memory blocks of a DIMM were onlined to different zones or re-onlined to a different zone after offlining. Example: :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 :/# echo "online_movable" > /sys/devices/system/memory/memory41/state :/# echo "online_movable" > /sys/devices/system/memory/memory43/state :/# cat /proc/zoneinfo Node 1, zone Movable spanned 98304 present 65536 managed 65536 :/# echo 0 > /sys/devices/system/memory/memory43/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 32768 present 32768 managed 32768 :/# echo 0 > /sys/devices/system/memory/memory41/online :/# cat /proc/zoneinfo Node 1, zone Movable spanned 0 present 0 managed 0 This patch (of 6): The third argument is actually number of pages. Change the variable name from size to nr_pages to indicate this better. No functional change in this patch. Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Hildenbrand
|
4c6058814e |
mm: factor out next_present_section_nr()
Let's move it to the header and use the shorter variant from mm/page_alloc.c (the original one will also check "__highest_present_section_nr + 1", which is not necessary). While at it, make the section_nr in next_pfn() const. In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we exceed __highest_present_section_nr, which doesn't make a difference in the caller as it is big enough (>= all sane end_pfn). Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Jin, Zhi" <zhi.jin@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Hildenbrand
|
948c436e46 |
mm/page_alloc: fix and rework pfn handling in memmap_init_zone()
Let's update the pfn manually whenever we continue the loop. This makes
the code easier to read but also less error prone (and we can directly fix
one issue).
When overlap_memmap_init() returns true, pfn is updated to
"memblock_region_memory_end_pfn(r)". So it already points at the *next*
pfn to process. Incrementing the pfn another time is wrong, we might
leave one uninitialized. I spotted this by inspecting the code, so I have
no idea if this is relevant in practise (with kernelcore=mirror).
Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
Fixes:
|
||
David Hildenbrand
|
4b094b7851 |
mm/page_alloc.c: initialize memmap of unavailable memory directly
Let's make sure that all memory holes are actually marked PageReserved(), that page_to_pfn() produces reliable results, and that these pages are not detected as "mmap" pages due to the mapcount. E.g., booting a x86-64 QEMU guest with 4160 MB: [ 0.010585] Early memory node ranges [ 0.010586] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.010588] node 0: [mem 0x0000000000100000-0x00000000bffdefff] [ 0.010589] node 0: [mem 0x0000000100000000-0x0000000143ffffff] max_pfn is 0x144000. Before this change: [root@localhost ~]# ./page-types -r -a 0x144000, flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000800 16384 64 ___________M_______________________________ mmap total 16384 64 After this change: [root@localhost ~]# ./page-types -r -a 0x144000, flags page-count MB symbolic-flags long-symbolic-flags 0x0000000100000000 16384 64 ___________________________r_______________ reserved total 16384 64 IOW, especially the unavailable physical memory ("memory hole") in the last section would not get properly marked PageReserved() and is indicated to be "mmap" memory. Drop the trace of that function from include/linux/mm.h - nobody else needs it, and rename it accordingly. Note: The fake zone/node might not be covered by the zone/node span. This is not an urgent issue (for now, we had the same node/zone due to the zeroing). We'll need a clean way to mark memory holes (e.g., using a page type PageHole() if possible or a fake ZONE_INVALID) and eventually stop marking these memory holes PageReserved(). Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Hildenbrand
|
e822969cab |
mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit |
||
Qian Cai
|
3d680bdf60 |
mm/page_isolation: fix potential warning from user
It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE) from start_isolate_page_range(), but should avoid triggering it from userspace, i.e, from is_mem_section_removable() because it could crash the system by a non-root user if warn_on_panic is set. While at it, simplify the code a bit by removing an unnecessary jump label. Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Qian Cai
|
4a55c0474a |
mm/hotplug: silence a lockdep splat with printk()
It is not that hard to trigger lockdep splats by calling printk from under zone->lock. Most of them are false positives caused by lock chains introduced early in the boot process and they do not cause any real problems (although most of the early boot lock dependencies could happen after boot as well). There are some console drivers which do allocate from the printk context as well and those should be fixed. In any case, false positives are not that trivial to workaround and it is far from optimal to lose lockdep functionality for something that is a non-issue. So change has_unmovable_pages() so that it no longer calls dump_page() itself - instead it returns a "struct page *" of the unmovable page back to the caller so that in the case of a has_unmovable_pages() failure, the caller can call dump_page() after releasing zone->lock. Also, make dump_page() is able to report a CMA page as well, so the reason string from has_unmovable_pages() can be removed. Even though has_unmovable_pages doesn't hold any reference to the returned page this should be reasonably safe for the purpose of reporting the page (dump_page) because it cannot be hotremoved in the context of memory unplug. The state of the page might change but that is the case even with the existing code as zone->lock only plays role for free pages. While at it, remove a similar but unnecessary debug-only printk() as well. A sample of one of those lockdep splats is, WARNING: possible circular locking dependency detected ------------------------------------------------------ test.sh/8653 is trying to acquire lock: ffffffff865a4460 (console_owner){-.-.}, at: console_unlock+0x207/0x750 but task is already holding lock: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at: __offline_isolated_pages+0x179/0x3e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&(&zone->lock)->rlock){-.-.}: __lock_acquire+0x5b3/0xb40 lock_acquire+0x126/0x280 _raw_spin_lock+0x2f/0x40 rmqueue_bulk.constprop.21+0xb6/0x1160 get_page_from_freelist+0x898/0x22c0 __alloc_pages_nodemask+0x2f3/0x1cd0 alloc_pages_current+0x9c/0x110 allocate_slab+0x4c6/0x19c0 new_slab+0x46/0x70 ___slab_alloc+0x58b/0x960 __slab_alloc+0x43/0x70 __kmalloc+0x3ad/0x4b0 __tty_buffer_request_room+0x100/0x250 tty_insert_flip_string_fixed_flag+0x67/0x110 pty_write+0xa2/0xf0 n_tty_write+0x36b/0x7b0 tty_write+0x284/0x4c0 __vfs_write+0x50/0xa0 vfs_write+0x105/0x290 redirected_tty_write+0x6a/0xc0 do_iter_write+0x248/0x2a0 vfs_writev+0x106/0x1e0 do_writev+0xd4/0x180 __x64_sys_writev+0x45/0x50 do_syscall_64+0xcc/0x76c entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #2 (&(&port->lock)->rlock){-.-.}: __lock_acquire+0x5b3/0xb40 lock_acquire+0x126/0x280 _raw_spin_lock_irqsave+0x3a/0x50 tty_port_tty_get+0x20/0x60 tty_port_default_wakeup+0xf/0x30 tty_port_tty_wakeup+0x39/0x40 uart_write_wakeup+0x2a/0x40 serial8250_tx_chars+0x22e/0x440 serial8250_handle_irq.part.8+0x14a/0x170 serial8250_default_handle_irq+0x5c/0x90 serial8250_interrupt+0xa6/0x130 __handle_irq_event_percpu+0x78/0x4f0 handle_irq_event_percpu+0x70/0x100 handle_irq_event+0x5a/0x8b handle_edge_irq+0x117/0x370 do_IRQ+0x9e/0x1e0 ret_from_intr+0x0/0x2a cpuidle_enter_state+0x156/0x8e0 cpuidle_enter+0x41/0x70 call_cpuidle+0x5e/0x90 do_idle+0x333/0x370 cpu_startup_entry+0x1d/0x1f start_secondary+0x290/0x330 secondary_startup_64+0xb6/0xc0 -> #1 (&port_lock_key){-.-.}: __lock_acquire+0x5b3/0xb40 lock_acquire+0x126/0x280 _raw_spin_lock_irqsave+0x3a/0x50 serial8250_console_write+0x3e4/0x450 univ8250_console_write+0x4b/0x60 console_unlock+0x501/0x750 vprintk_emit+0x10d/0x340 vprintk_default+0x1f/0x30 vprintk_func+0x44/0xd4 printk+0x9f/0xc5 -> #0 (console_owner){-.-.}: check_prev_add+0x107/0xea0 validate_chain+0x8fc/0x1200 __lock_acquire+0x5b3/0xb40 lock_acquire+0x126/0x280 console_unlock+0x269/0x750 vprintk_emit+0x10d/0x340 vprintk_default+0x1f/0x30 vprintk_func+0x44/0xd4 printk+0x9f/0xc5 __offline_isolated_pages.cold.52+0x2f/0x30a offline_isolated_pages_cb+0x17/0x30 walk_system_ram_range+0xda/0x160 __offline_pages+0x79c/0xa10 offline_pages+0x11/0x20 memory_subsys_offline+0x7e/0xc0 device_offline+0xd5/0x110 state_store+0xc6/0xe0 dev_attr_store+0x3f/0x60 sysfs_kf_write+0x89/0xb0 kernfs_fop_write+0x188/0x240 __vfs_write+0x50/0xa0 vfs_write+0x105/0x290 ksys_write+0xc6/0x160 __x64_sys_write+0x43/0x50 do_syscall_64+0xcc/0x76c entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&zone->lock)->rlock); lock(&(&port->lock)->rlock); lock(&(&zone->lock)->rlock); lock(console_owner); *** DEADLOCK *** 9 locks held by test.sh/8653: #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at: vfs_write+0x25f/0x290 #1: ffff888277618880 (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240 #2: ffff8898131fc218 (kn->count#115){.+.+}, at: kernfs_fop_write+0x138/0x240 #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at: lock_device_hotplug_sysfs+0x16/0x50 #4: ffff8884374f4990 (&dev->mutex){....}, at: device_offline+0x70/0x110 #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at: __offline_pages+0xbf/0xa10 #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x87/0x2f0 #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at: __offline_isolated_pages+0x179/0x3e0 #8: ffffffff865a4920 (console_lock){+.+.}, at: vprintk_emit+0x100/0x340 stack backtrace: Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 05/21/2019 Call Trace: dump_stack+0x86/0xca print_circular_bug.cold.31+0x243/0x26e check_noncircular+0x29e/0x2e0 check_prev_add+0x107/0xea0 validate_chain+0x8fc/0x1200 __lock_acquire+0x5b3/0xb40 lock_acquire+0x126/0x280 console_unlock+0x269/0x750 vprintk_emit+0x10d/0x340 vprintk_default+0x1f/0x30 vprintk_func+0x44/0xd4 printk+0x9f/0xc5 __offline_isolated_pages.cold.52+0x2f/0x30a offline_isolated_pages_cb+0x17/0x30 walk_system_ram_range+0xda/0x160 __offline_pages+0x79c/0xa10 offline_pages+0x11/0x20 memory_subsys_offline+0x7e/0xc0 device_offline+0xd5/0x110 state_store+0xc6/0xe0 dev_attr_store+0x3f/0x60 sysfs_kf_write+0x89/0xb0 kernfs_fop_write+0x188/0x240 __vfs_write+0x50/0xa0 vfs_write+0x105/0x290 ksys_write+0xc6/0x160 __x64_sys_write+0x43/0x50 do_syscall_64+0xcc/0x76c entry_SYSCALL_64_after_hwframe+0x49/0xbe Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
David Hildenbrand
|
fe4c86c916 |
mm: remove "count" parameter from has_unmovable_pages()
Now that the memory isolate notifier is gone, the parameter is always 0. Drop it and cleanup has_unmovable_pages(). Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Kirill A. Shutemov
|
3f1353552e |
mm/page_alloc: skip non present sections on zone initialization
memmap_init_zone() can be called on the ranges with holes during the boot. It will skip any non-valid PFNs one-by-one. It works fine as long as holes are not too big. But huge holes in the memory map causes a problem. It takes over 20 seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much larger holes in the memory map which would practically hang the system. Deferred struct page init doesn't help here. It only works on the present ranges. Skipping non-present sections would fix the issue. Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Jin, Zhi" <zhi.jin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vlastimil Babka
|
8e57f8acbb |
mm, debug_pagealloc: don't rely on static keys too early
Commit |