I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull arch/sh updates from Rich Felker.
* tag 'sh-for-5.16' of git://git.libc.org/linux-sh:
sh: pgtable-3level: Fix cast to pointer from integer of different size
sh: fix READ/WRITE redefinition warnings
sh: define __BIG_ENDIAN for math-emu
sh: math-emu: drop unused functions
sh: fix kconfig unmet dependency warning for FRAME_POINTER
sh: Cleanup about SPARSE_IRQ
sh: kdump: add some attribute to function
maple: fix wrong return value of maple_bus_init().
sh: boot: avoid unneeded rebuilds under arch/sh/boot/compressed/
sh: boot: add intermediate vmlinux.bin* to targets instead of extra-y
sh: boards: Fix the cacography in irq.c
sh: check return code of request_irq
sh: fix trivial misannotations
Trivial misannotations in
* get_user() (__gu_addr is a userland pointer there)
* ip_fast_csum() (sum is __wsum, not unsigned int)
* csum_and_copy_to_user() (destination is void *, not const void * -
mea culpa)
* __clear_user() (to is a userland pointer)
* several places in kernel/traps_32.c (regs->pc is a userland pointer
when regs is a userland pt_regs)
* math-emu/math.c: READ() and WRITE() casts of address should be to
userland pointer.
No changes in code generation and those take care of the majority of
noise from sparse on sh builds.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: Rich Felker <dalias@libc.org>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Rich Felker <dalias@libc.org>
Pull arch/sh updates from Rich Felker:
"Cleanup, SECCOMP_FILTER support, message printing fixes, and other
changes to arch/sh"
* tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits)
sh: landisk: Add missing initialization of sh_io_port_base
sh: bring syscall_set_return_value in line with other architectures
sh: Add SECCOMP_FILTER
sh: Rearrange blocks in entry-common.S
sh: switch to copy_thread_tls()
sh: use the generic dma coherent remap allocator
sh: don't allow non-coherent DMA for NOMMU
dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig
sh: unexport register_trapped_io and match_trapped_io_handler
sh: don't include <asm/io_trapped.h> in <asm/io.h>
sh: move the ioremap implementation out of line
sh: move ioremap_fixed details out of <asm/io.h>
sh: remove __KERNEL__ ifdefs from non-UAPI headers
sh: sort the selects for SUPERH alphabetically
sh: remove -Werror from Makefiles
sh: Replace HTTP links with HTTPS ones
arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA*
sh: stacktrace: Remove stacktrace_ops.stack()
sh: machvec: Modernize printing of kernel messages
sh: pci: Modernize printing of kernel messages
...
Move the internal implementation details of ioremap out of line, no need
to expose any of this to drivers for a slow path API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Felker <dalias@libc.org>
ioremap_fixed is an internal implementation detail and should not be
exposed to drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Felker <dalias@libc.org>
The sh build is full of warnings when building with gcc 9.2.1. While
fixing those would be great, at least avoid failing the build.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rich Felker <dalias@libc.org>
This reverts commit 2deebe4d56.
printk_address() is always used as a continuation of the previous
logging, hence it should not include a log level.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rich Felker <dalias@libc.org>
Somewhere along the patch handling path, both the old "printk(KERN_ALERT
....)" and the new "pr_alert(...)" were retained, leading to the
duplicate printing of "PC:".
Drop the old one.
Fixes: eaabf98b09 ("sh: fault: modernize printing of kernel messages")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Rich Felker <dalias@libc.org>
Function dma_alloc_coherent use in buf already zeroes out memory,
so memset is not needed.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Rich Felker <dalias@libc.org>
The pgd kmem_cache allocation both specified __GFP_ZERO and had a
constructor which makes no sense. Remove __GFP_ZERO and zero the user
parts of the pgd explicitly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Rich Felker <dalias@libc.org>
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().
Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.
Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.
Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
The powerpc 32-bit implementation of pgtable has nice shortcuts for
accessing kernel PMD and PTE for a given virtual address. Make these
helpers available for all architectures.
[rppt@linux.ibm.com: microblaze: fix page table traversal in setup_rt_frame()]
Link: http://lkml.kernel.org/r/20200518191511.GD1118872@kernel.org
[akpm@linux-foundation.org: s/pmd_ptr_k/pmd_off_k/ in various powerpc places]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-9-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Add log level argument to printk_address() as a preparation to introduce
show_stack_loglvl().
As a good side-effect show_fault_oops() now prints the address with
KERN_EMREG as the rest of output, making sure there won't be situation
where "PC: " is printed without actual address.
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rich Felker <dalias@libc.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-32-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull arch/sh updates from Rich Felker:
"Fix for arch/sh build regression with newer binutils, removal of SH5,
fixes for module exports, and misc cleanup"
* tag 'sh-for-5.8' of git://git.libc.org/linux-sh:
sh: remove sh5 support
sh: add missing EXPORT_SYMBOL() for __delay
sh: Convert ins[bwl]/outs[bwl] macros to inline functions
sh: Convert iounmap() macros to inline functions
sh: Add missing DECLARE_EXPORT() for __ashiftrt_r4_xx
sh: configs: Cleanup old Kconfig IO scheduler options
arch/sh: vmlinux.scr
sh: Replace CONFIG_MTD_M25P80 with CONFIG_MTD_SPI_NOR in sh7757lcr_defconfig
sh: sh4a: Bring back tmu3_device early device
sh5 never became a product and has probably never really worked.
Remove it by recursively deleting all associated Kconfig options
and all corresponding files.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rich Felker <dalias@libc.org>
devm_memremap_pages() is currently used by the PCI P2PDMA code to create
struct page mappings for IO memory. At present, these mappings are
created with PAGE_KERNEL which implies setting the PAT bits to be WB.
However, on x86, an mtrr register will typically override this and force
the cache type to be UC-. In the case firmware doesn't set this
register it is effectively WB and will typically result in a machine
check exception when it's accessed.
Other arches are not currently likely to function correctly seeing they
don't have any MTRR registers to fall back on.
To solve this, provide a way to specify the pgprot value explicitly to
arch_add_memory().
Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a
simple change to pass the pgprot_t down to their respective functions
which set up the page tables. For x86_32, set the page tables
explicitly using _set_memory_prot() (seeing they are already mapped).
For ia64, s390 and sh, reject anything but PAGE_KERNEL settings -- this
should be fine, for now, seeing these architectures don't support
ZONE_DEVICE.
A check in __add_pages() is also added to ensure the pgprot parameter
was set for all arches.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Badger <ebadger@gigaio.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200306170846.9333-7-logang@deltatee.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>