linux/arch/arm/lib/uaccess_with_memcpy.c

283 lines
6.2 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
/*
* linux/arch/arm/lib/uaccess_with_memcpy.c
*
* Written by: Lennert Buytenhek and Nicolas Pitre
* Copyright (C) 2009 Marvell Semiconductor
*/
#include <linux/kernel.h>
#include <linux/ctype.h>
#include <linux/uaccess.h>
#include <linux/rwsem.h>
#include <linux/mm.h>
#include <linux/sched.h>
#include <linux/hardirq.h> /* for in_atomic() */
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
#include <linux/gfp.h>
#include <linux/highmem.h>
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
#include <linux/hugetlb.h>
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
#include <asm/current.h>
#include <asm/page.h>
static int
pin_page_for_write(const void __user *_addr, pte_t **ptep, spinlock_t **ptlp)
{
unsigned long addr = (unsigned long)_addr;
pgd_t *pgd;
arm: add support for folded p4d page tables Implement primitives necessary for the 4th level folding, add walks of p4d level where appropriate, and remove __ARCH_USE_5LEVEL_HACK. [rppt@linux.ibm.com: fix kexec] Link: http://lkml.kernel.org/r/20200508174232.GA759899@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: James Morse <james.morse@arm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200414153455.21744-3-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 23:46:19 +00:00
p4d_t *p4d;
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
pmd_t *pmd;
pte_t *pte;
pud_t *pud;
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
spinlock_t *ptl;
pgd = pgd_offset(current->mm, addr);
if (unlikely(pgd_none(*pgd) || pgd_bad(*pgd)))
return 0;
arm: add support for folded p4d page tables Implement primitives necessary for the 4th level folding, add walks of p4d level where appropriate, and remove __ARCH_USE_5LEVEL_HACK. [rppt@linux.ibm.com: fix kexec] Link: http://lkml.kernel.org/r/20200508174232.GA759899@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: James Morse <james.morse@arm.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200414153455.21744-3-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 23:46:19 +00:00
p4d = p4d_offset(pgd, addr);
if (unlikely(p4d_none(*p4d) || p4d_bad(*p4d)))
return 0;
pud = pud_offset(p4d, addr);
if (unlikely(pud_none(*pud) || pud_bad(*pud)))
return 0;
pmd = pmd_offset(pud, addr);
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
if (unlikely(pmd_none(*pmd)))
return 0;
/*
* A pmd can be bad if it refers to a HugeTLB or THP page.
*
* Both THP and HugeTLB pages have the same pmd layout
* and should not be manipulated by the pte functions.
*
* Lock the page table for the destination and check
* to see that it's still huge and whether or not we will
* need to fault on write.
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
*/
mm/arm: remove pmd_thp_or_huge() ARM/ARM64 used to define pmd_thp_or_huge(). Now this macro is completely redundant. Remove it and use pmd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-18 20:04:03 +00:00
if (unlikely(pmd_leaf(*pmd))) {
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
ptl = &current->mm->page_table_lock;
spin_lock(ptl);
mm/arm: remove pmd_thp_or_huge() ARM/ARM64 used to define pmd_thp_or_huge(). Now this macro is completely redundant. Remove it and use pmd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-18 20:04:03 +00:00
if (unlikely(!pmd_leaf(*pmd)
|| pmd_hugewillfault(*pmd))) {
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
spin_unlock(ptl);
return 0;
}
*ptep = NULL;
*ptlp = ptl;
return 1;
}
if (unlikely(pmd_bad(*pmd)))
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
return 0;
pte = pte_offset_map_lock(current->mm, pmd, addr, &ptl);
arm: allow pte_offset_map[_lock]() to fail Patch series "arch: allow pte_offset_map[_lock]() to fail", v2. What is it all about? Some mmap_lock avoidance i.e. latency reduction. Initially just for the case of collapsing shmem or file pages to THPs; but likely to be relied upon later in other contexts e.g. freeing of empty page tables (but that's not work I'm doing). mmap_write_lock avoidance when collapsing to anon THPs? Perhaps, but again that's not work I've done: a quick attempt was not as easy as the shmem/file case. I would much prefer not to have to make these small but wide-ranging changes for such a niche case; but failed to find another way, and have heard that shmem MADV_COLLAPSE's usefulness is being limited by that mmap_write_lock it currently requires. These changes (though of course not these exact patches, and not all of these architectures!) have been in Google's data centre kernel for three years now: we do rely upon them. What are the per-arch changes about? Generally, two things. One: the current mmap locking may not be enough to guard against that tricky transition between pmd entry pointing to page table, and empty pmd entry, and pmd entry pointing to huge page: pte_offset_map() will have to validate the pmd entry for itself, returning NULL if no page table is there. What to do about that varies: often the nearby error handling indicates just to skip it; but in some cases a "goto again" looks appropriate (and if that risks an infinite loop, then there must have been an oops, or pfn 0 mistaken for page table, before). Deeper study of each site might show that 90% of them here in arch code could only fail if there's corruption e.g. a transition to THP would be surprising on an arch without HAVE_ARCH_TRANSPARENT_HUGEPAGE. But given the likely extension to freeing empty page tables, I have not limited this set of changes to THP; and it has been easier, and sets a better example, if each site is given appropriate handling. Two: pte_offset_map() will need to do an rcu_read_lock(), with the corresponding rcu_read_unlock() in pte_unmap(). But most architectures never supported CONFIG_HIGHPTE, so some don't always call pte_unmap() after pte_offset_map(), or have used userspace pte_offset_map() where pte_offset_kernel() is more correct. No problem in the current tree, but a problem once an rcu_read_unlock() will be needed to keep balance. A common special case of that comes in arch/*/mm/hugetlbpage.c, if the architecture supports hugetlb pages down at the lowest PTE level. huge_pte_alloc() uses pte_alloc_map(), but generic hugetlb code does no corresponding pte_unmap(); similarly for huge_pte_offset(). In rare transient cases, not yet made possible, pte_offset_map() and pte_offset_map_lock() may not find a page table: handle appropriately. Link: https://lkml.kernel.org/r/a4963be9-7aa6-350-66d0-2ba843e1af44@google.com Link: https://lkml.kernel.org/r/813429a1-204a-1844-eeae-7fd72826c28@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: John David Anglin <dave.anglin@bell.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Will Deacon <will@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08 19:10:57 +00:00
if (unlikely(!pte))
return 0;
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
if (unlikely(!pte_present(*pte) || !pte_young(*pte) ||
!pte_write(*pte) || !pte_dirty(*pte))) {
pte_unmap_unlock(pte, ptl);
return 0;
}
*ptep = pte;
*ptlp = ptl;
return 1;
}
static unsigned long noinline
__copy_to_user_memcpy(void __user *to, const void *from, unsigned long n)
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
{
unsigned long ua_flags;
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
int atomic;
/* the mmap semaphore is taken only if not in an atomic context */
atomic = faulthandler_disabled();
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
if (!atomic)
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_lock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
while (n) {
pte_t *pte;
spinlock_t *ptl;
int tocopy;
while (!pin_page_for_write(to, &pte, &ptl)) {
if (!atomic)
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_unlock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
if (__put_user(0, (char __user *)to))
goto out;
if (!atomic)
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_lock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
}
tocopy = (~(unsigned long)to & ~PAGE_MASK) + 1;
if (tocopy > n)
tocopy = n;
ua_flags = uaccess_save_and_enable();
__memcpy((void *)to, from, tocopy);
uaccess_restore(ua_flags);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
to += tocopy;
from += tocopy;
n -= tocopy;
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
if (pte)
pte_unmap_unlock(pte, ptl);
else
spin_unlock(ptl);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
}
if (!atomic)
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_unlock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
out:
return n;
}
unsigned long
arm_copy_to_user(void __user *to, const void *from, unsigned long n)
{
/*
* This test is stubbed out of the main function above to keep
* the overhead for small copies low by avoiding a large
* register dump on the stack just to reload them right away.
* With frame pointer disabled, tail call optimization kicks in
* as well making this test almost invisible.
*/
if (n < 64) {
unsigned long ua_flags = uaccess_save_and_enable();
n = __copy_to_user_std(to, from, n);
uaccess_restore(ua_flags);
} else {
n = __copy_to_user_memcpy(uaccess_mask_range_ptr(to, n),
from, n);
}
return n;
}
static unsigned long noinline
__clear_user_memset(void __user *addr, unsigned long n)
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
{
unsigned long ua_flags;
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_lock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
while (n) {
pte_t *pte;
spinlock_t *ptl;
int tocopy;
while (!pin_page_for_write(addr, &pte, &ptl)) {
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_unlock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
if (__put_user(0, (char __user *)addr))
goto out;
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_lock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
}
tocopy = (~(unsigned long)addr & ~PAGE_MASK) + 1;
if (tocopy > n)
tocopy = n;
ua_flags = uaccess_save_and_enable();
__memset((void *)addr, 0, tocopy);
uaccess_restore(ua_flags);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
addr += tocopy;
n -= tocopy;
ARM: 7858/1: mm: make UACCESS_WITH_MEMCPY huge page aware The memory pinning code in uaccess_with_memcpy.c does not check for HugeTLB or THP pmds, and will enter an infinite loop should a __copy_to_user or __clear_user occur against a huge page. This patch adds detection code for huge pages to pin_page_for_write. As this code can be executed in a fast path it refers to the actual pmds rather than the vma. If a HugeTLB or THP is found (they have the same pmd representation on ARM), the page table spinlock is taken to prevent modification whilst the page is pinned. On ARM, huge pages are only represented as pmds, thus no huge pud checks are performed. (For huge puds one would lock the page table in a similar manner as in the pmd case). Two helper functions are introduced; pmd_thp_or_huge will check whether or not a page is huge or transparent huge (which have the same pmd layout on ARM), and pmd_hugewillfault will detect whether or not a page fault will occur on write to the page. Running the following test (with the chunking from read_zero removed): $ dd if=/dev/zero of=/dev/null bs=10M count=1024 Gave: 2.3 GB/s backed by normal pages, 2.9 GB/s backed by huge pages, 5.1 GB/s backed by huge pages, with page mask=HPAGE_MASK. After some discussion, it was decided not to adopt the HPAGE_MASK, as this would have a significant detrimental effect on the overall system latency due to page_table_lock being held for too long. This could be revisited if split huge page locks are adopted. Signed-off-by: Steve Capper <steve.capper@linaro.org> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-10-14 08:49:10 +00:00
if (pte)
pte_unmap_unlock(pte, ptl);
else
spin_unlock(ptl);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
}
mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 04:33:25 +00:00
mmap_read_unlock(current->mm);
[ARM] alternative copy_to_user/clear_user implementation This implements {copy_to,clear}_user() by faulting in the userland pages and then using the regular kernel mem{cpy,set}() to copy the data (while holding the page table lock). This is a win if the regular mem{cpy,set}() implementations are faster than the user copy functions, which is the case e.g. on Feroceon, where 8-word STMs (which memcpy() uses under the right conditions) give significantly higher memory write throughput than a sequence of individual 32bit stores. Here are numbers for page sized buffers on some Feroceon cores: - copy_to_user on Orion5x goes from 51 MB/s to 83 MB/s - clear_user on Orion5x goes from 89MB/s to 314MB/s - copy_to_user on Kirkwood goes from 240 MB/s to 356 MB/s - clear_user on Kirkwood goes from 367 MB/s to 1108 MB/s - copy_to_user on Disco-Duo goes from 248 MB/s to 398 MB/s - clear_user on Disco-Duo goes from 328 MB/s to 1741 MB/s Because the setup cost is non negligible, this is worthwhile only if the amount of data to copy is large enough. The operation falls back to the standard implementation when the amount of data is below a certain threshold. This threshold was determined empirically, however some targets could benefit from a lower runtime determined value for optimal results eventually. In the copy_from_user() case, this technique does not provide any worthwhile performance gain due to the fact that any kind of read access allocates the cache and subsequent 32bit loads are just as fast as the equivalent 8-word LDM. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Nicolas Pitre <nico@marvell.com> Tested-by: Martin Michlmayr <tbm@cyrius.com>
2009-03-09 18:30:09 +00:00
out:
return n;
}
unsigned long arm_clear_user(void __user *addr, unsigned long n)
{
/* See rational for this in __copy_to_user() above. */
if (n < 64) {
unsigned long ua_flags = uaccess_save_and_enable();
n = __clear_user_std(addr, n);
uaccess_restore(ua_flags);
} else {
n = __clear_user_memset(addr, n);
}
return n;
}
#if 0
/*
* This code is disabled by default, but kept around in case the chosen
* thresholds need to be revalidated. Some overhead (small but still)
* would be implied by a runtime determined variable threshold, and
* so far the measurement on concerned targets didn't show a worthwhile
* variation.
*
* Note that a fairly precise sched_clock() implementation is needed
* for results to make some sense.
*/
#include <linux/vmalloc.h>
static int __init test_size_treshold(void)
{
struct page *src_page, *dst_page;
void *user_ptr, *kernel_ptr;
unsigned long long t0, t1, t2;
int size, ret;
ret = -ENOMEM;
src_page = alloc_page(GFP_KERNEL);
if (!src_page)
goto no_src;
dst_page = alloc_page(GFP_KERNEL);
if (!dst_page)
goto no_dst;
kernel_ptr = page_address(src_page);
arm/mm: enable ARCH_HAS_VM_GET_PAGE_PROT This enables ARCH_HAS_VM_GET_PAGE_PROT on the platform and exports standard vm_get_page_prot() implementation via DECLARE_VM_GET_PAGE_PROT, which looks up a private and static protection_map[] array. Subsequently all __SXXX and __PXXX macros can be dropped which are no longer needed. Link: https://lkml.kernel.org/r/20220711070600.2378316-24-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-11 07:05:57 +00:00
user_ptr = vmap(&dst_page, 1, VM_IOREMAP, __pgprot(__PAGE_COPY));
if (!user_ptr)
goto no_vmap;
/* warm up the src page dcache */
ret = __copy_to_user_memcpy(user_ptr, kernel_ptr, PAGE_SIZE);
for (size = PAGE_SIZE; size >= 4; size /= 2) {
t0 = sched_clock();
ret |= __copy_to_user_memcpy(user_ptr, kernel_ptr, size);
t1 = sched_clock();
ret |= __copy_to_user_std(user_ptr, kernel_ptr, size);
t2 = sched_clock();
printk("copy_to_user: %d %llu %llu\n", size, t1 - t0, t2 - t1);
}
for (size = PAGE_SIZE; size >= 4; size /= 2) {
t0 = sched_clock();
ret |= __clear_user_memset(user_ptr, size);
t1 = sched_clock();
ret |= __clear_user_std(user_ptr, size);
t2 = sched_clock();
printk("clear_user: %d %llu %llu\n", size, t1 - t0, t2 - t1);
}
if (ret)
ret = -EFAULT;
vunmap(user_ptr);
no_vmap:
put_page(dst_page);
no_dst:
put_page(src_page);
no_src:
return ret;
}
subsys_initcall(test_size_treshold);
#endif