Pull thermal management updates from Zhang Rui:
"We have a lot of SOC changes and a few thermal core fixes this time.
The biggest change is about exynos thermal driver restructure. The
patch set adds TMU (Thermal management Unit) driver support for
exynos5440 platform. There are 3 instances of the TMU controllers so
necessary cleanup/re-structure is done to handle multiple thermal
zone.
The next biggest change is the introduction of the imx thermal driver.
It adds the imx thermal support using Temperature Monitor (TEMPMON)
block found on some Freescale i.MX SoCs. The driver uses syscon
regmap interface to access TEMPMON control registers and calibration
data, and supports cpufreq as the cooling device.
Highlights:
- restructure exynos thermal driver.
- introduce new imx thermal driver.
- fix a bug in thermal core, which powers on the fans unexpectedly
after resume from suspend"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (46 commits)
drivers: thermal: add check when unregistering cpu cooling
thermal: thermal_core: allow binding with limits on bind_params
drivers: thermal: make usage of CONFIG_THERMAL_HWMON optional
drivers: thermal: parent virtual hwmon with thermal zone
thermal: hwmon: move hwmon support to single file
thermal: exynos: Clean up non-DT remnants
thermal: exynos: Fix potential NULL pointer dereference
thermal: exynos: Fix typos in Kconfig
thermal: ti-soc-thermal: Ensure to compute thermal trend
thermal: ti-soc-thermal: Set the bandgap mask counter delay value
thermal: ti-soc-thermal: Initialize counter_delay field for TI DRA752 sensors
thermal: step_wise: return instance->target by default
thermal: step_wise: cdev only needs update on a new target state
Thermal/cpu_cooling: Return directly for the cpu out of allowed_cpus in the cpufreq_thermal_notifier()
thermal: exynos_tmu: fix wrong error check for mapped memory
thermal: imx: implement thermal alarm interrupt handling
thermal: imx: dynamic passive and SoC specific critical trip points
Documentation: thermal: Explain the exynos thermal driver model
ARM: dts: thermal: exynos: Add documentation for Exynos SoC thermal bindings
thermal: exynos: Support for TMU regulator defined at device tree
...
Pull CIFS fixes from Steve French:
"CIFS update including case insensitive file name matching improvements
for UTF-8 to Unicode, various small cifs fixes, SMB2/SMB3 leasing
improvements, support for following SMB2 symlinks, SMB3 packet signing
improvements"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (25 commits)
CIFS: Respect epoch value from create lease context v2
CIFS: Add create lease v2 context for SMB3
CIFS: Move parsing lease buffer to ops struct
CIFS: Move creating lease buffer to ops struct
CIFS: Store lease state itself rather than a mapped oplock value
CIFS: Replace clientCanCache* bools with an integer
[CIFS] quiet sparse compile warning
cifs: Start using per session key for smb2/3 for signature generation
cifs: Add a variable specific to NTLMSSP for key exchange.
cifs: Process post session setup code in respective dialect functions.
CIFS: convert to use le32_add_cpu()
CIFS: Fix missing lease break
CIFS: Fix a memory leak when a lease break comes
cifs: add winucase_convert.pl to Documentation/ directory
cifs: convert case-insensitive dentry ops to use new case conversion routines
cifs: add new case-insensitive conversion routines that are based on wchar_t's
[CIFS] Add Scott to list of cifs contributors
cifs: Move and expand MAX_SERVER_SIZE definition
cifs: Expand max share name length to 256
cifs: Move string length definitions to uapi
...
Merge first patch-bomb from Andrew Morton:
- Some pidns/fork/exec tweaks
- OCFS2 updates
- Most of MM - there remain quite a few memcg parts which depend on
pending core cgroups changes. Which might have been already merged -
I'll check tomorrow...
- Various misc stuff all over the place
- A few block bits which I never got around to sending to Jens -
relatively minor things.
- MAINTAINERS maintenance
- A small number of lib/ updates
- checkpatch updates
- epoll
- firmware/dmi-scan
- Some kprobes work for S390
- drivers/rtc updates
- hfsplus feature work
- vmcore feature work
- rbtree upgrades
- AOE updates
- pktcdvd cleanups
- PPS
- memstick
- w1
- New "inittmpfs" feature, which does the obvious
- More IPC work from Davidlohr.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (303 commits)
lz4: fix compression/decompression signedness mismatch
ipc: drop ipc_lock_check
ipc, shm: drop shm_lock_check
ipc: drop ipc_lock_by_ptr
ipc, shm: guard against non-existant vma in shmdt(2)
ipc: document general ipc locking scheme
ipc,msg: drop msg_unlock
ipc: rename ids->rw_mutex
ipc,shm: shorten critical region for shmat
ipc,shm: cleanup do_shmat pasta
ipc,shm: shorten critical region for shmctl
ipc,shm: make shmctl_nolock lockless
ipc,shm: introduce shmctl_nolock
ipc: drop ipcctl_pre_down
ipc,shm: shorten critical region in shmctl_down
ipc,shm: introduce lockless functions to obtain the ipc object
initmpfs: use initramfs if rootfstype= or root= specified
initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled
initmpfs: move rootfs code from fs/ramfs/ to init/
initmpfs: move bdi setup from init_rootfs to init_ramfs
...
LZ4 compression and decompression functions require different in
signedness input/output parameters: unsigned char for compression and
signed char for decompression.
Change decompression API to require "(const) unsigned char *".
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Kyungsik Lee <kyungsik.lee@lge.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yann Collet <yann.collet.73@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since in some situations the lock can be shared for readers, we shouldn't
be calling it a mutex, rename it to rwsem.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the rootfs code was a wrapper around ramfs, having them in the same
file made sense. Now that it can wrap another filesystem type, move it in
with the init code instead.
This also allows a subsequent patch to access rootfstype= command line
arg.
Signed-off-by: Rob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because deletion (of the entire tree) is a relatively common use of the
rbtree_postorder iteration, and because doing it safely means fiddling
with temporary storage, provide a helper to simplify postorder rbtree
iteration.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Postorder iteration yields all of a node's children prior to yielding the
node itself, and this particular implementation also avoids examining the
leaf links in a node after that node has been yielded.
In what I expect will be its most common usage, postorder iteration allows
the deletion of every node in an rbtree without modifying the rbtree nodes
(no _requirement_ that they be nulled) while avoiding referencing child
nodes after they have been "deleted" (most commonly, freed).
I have only updated zswap to use this functionality at this point, but
numerous bits of code (most notably in the filesystem drivers) use a hand
rolled postorder iteration that NULLs child links as it traverses the
tree. Each of those instances could be replaced with this common
implementation.
1 & 2 add rbtree postorder iteration functions.
3 adds testing of the iteration to the rbtree runtime tests
4 allows building the rbtree runtime tests as builtins
5 updates zswap.
This patch:
Add postorder iteration functions for rbtree. These are useful for safely
freeing an entire rbtree without modifying the tree at all.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For zfcpdump we can't map the HSA storage because it is only available via
a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.
This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything
besides of the HSA memory. For the areas that are not mapped by
remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
handler mmap_vmcore_fault() is called.
This handler works as follows:
* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
Return that page
* If no:
Fill page using __vmcore_read(), set PageUptodate, and return page
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump). We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.
The advantages of this mechanism are:
* No crashkernel memory has to be defined in the old kernel.
* Early boot problems (before kexec_load has been done) can be dumped
* Non-Linux systems can be dumped.
We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.
Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.
So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.
The following steps are done during zfcpdump execution:
1. Production system crashes
2. User boots a SCSI disk that has been prepared with the zfcpdump tool
3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4. Boot loader loads kernel into low memory area
5. Kernel boots and uses only HSA_SIZE bytes of memory
6. Kernel saves registers of non-boot CPUs
7. Kernel does memory detection for dump memory map
8. Kernel creates ELF header for /proc/vmcore
9. /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
- copy_oldmem_page() copies from HSA for memory below HSA_SIZE
- copy_oldmem_page() copies from real memory for memory above HSA_SIZE
Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick. We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data. This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.
This patch:
Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.
To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:
* elfcorehdr_alloc: Allocate ELF header
* elfcorehdr_free: Free the memory of the ELF header
* elfcorehdr_read: Read from ELF header
* elfcorehdr_read_notes: Read from ELF notes
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody except search_binary_handler() should touch ->recursion_depth, "int
depth" buys nothing but complicates the code, kill it.
Probably we should also kill "fn" and the !NULL check, ->load_binary
should be always defined. And it can not go away after read_unlock() or
this code is buggy anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current two insn slot caches both use module_alloc/module_free to
allocate and free insn slot cache pages.
For s390 this is not sufficient since there is the need to allocate insn
slots that are either within the vmalloc module area or within dma memory.
Therefore add a mechanism which allows to specify an own allocator for an
own insn slot cache.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current kpropes insn caches allocate memory areas for insn slots
with module_alloc(). The assumption is that the kernel image and module
area are both within the same +/- 2GB memory area.
This however is not true for s390 where the kernel image resides within
the first 2GB (DMA memory area), but the module area is far away in the
vmalloc area, usually somewhere close below the 4TB area.
For new pc relative instructions s390 needs insn slots that are within
+/- 2GB of each area. That way we can patch displacements of
pc-relative instructions within the insn slots just like x86 and
powerpc.
The module area works already with the normal insn slot allocator,
however there is currently no way to get insn slots that are within the
first 2GB on s390 (aka DMA area).
Therefore this patch set modifies the kprobes insn slot cache code in
order to allow to specify a custom allocator for the insn slot cache
pages. In addition architecure can now have private insn slot caches
withhout the need to modify common code.
Patch 1 unifies and simplifies the current insn and optinsn caches
implementation. This is a preparation which allows to add more
insn caches in a simple way.
Patch 2 adds the possibility to specify a custom allocator.
Patch 3 makes s390 use the new insn slot mechanisms and adds support for
pc-relative instructions with long displacements.
This patch (of 3):
The two insn caches (insn, and optinsn) each have an own mutex and
alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).
Since there is the need for yet another insn cache which satifies dma
allocations on s390, unify and simplify the current implementation:
- Move the per insn cache mutex into struct kprobe_insn_cache.
- Move the alloc/free functions to kprobe.h so they are simply
wrappers for the generic __get_insn_slot/__free_insn_slot functions.
The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
which provides the alloc/free functions for each cache if needed.
- move the struct kprobe_insn_cache to kprobe.h which allows to generate
architecture specific insn slot caches outside of the core kprobes
code.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unclutter -Wmissing-prototypes warning types (enabled at make W=1)
linux/include/linux/syscalls.h:190:18: warning: no previous prototype for 'SyS_semctl' [-Wmissing-prototypes]
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
^
linux/include/linux/syscalls.h:183:2: note: in expansion of macro '__SYSCALL_DEFINEx'
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
^
by adding forward declarations right before definitions.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All of the other non-trivial !SMP versions of functions in smp.h are
out-of-line in up.c. Move on_each_cpu() there as well.
This allows us to get rid of the #include <linux/irqflags.h>. The
drawback is that this makes both the x86_64 and i386 defconfig !SMP
kernels about 200 bytes larger each.
Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As in commit f21afc25f9 ("smp.h: Use local_irq_{save,restore}() in
!SMP version of on_each_cpu()"), we don't want to enable irqs if they
are not already enabled. There are currently no known problematical
callers of these functions, but since it is a known failure pattern, we
preemptively fix them.
Since they are not trivial functions, make them non-inline by moving
them to up.c. This also makes it so we don't have to fix #include
dependancies for preempt_{disable,enable}.
Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
is used to serialize put_page against __split_huge_page_refcount(). In
addition, transparent hugepages will be splitted in hwpoison handler and
just one subpage will be poisoned. There is unnecessary to hold compound
lock for hugetlbfs page. This patch replace compound_trans_order by
compond_order in the place where the page is hugetlbfs page.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling. For such filesystems balance_dirty_pages always check bdi
counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.
The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
filesystem may set the flag when it initializes its BDI.
The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
writeback). The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately. A fuse request queued
for real processing bears a copy of original page. Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.
To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".
The problem was very easy to reproduce. There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region. An
hour later I observed almost all RAM consumed by fuse writeback. Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.
Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users. This is write-through
page cache policy FUSE currently uses. I.e. handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation. Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.
And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature. Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain. Let's make simple
computations. Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.
After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule. May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).
[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not used globally and could be static.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.
Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:
kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.
Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.
In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737
We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.
Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.
[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
each individual struct page. This entails repeated full page table
translations and page table lock taken for each page separately.
This patch avoids the costly follow_page_mask() where possible, by
iterating over ptes within single pmd under single page table lock. The
first pte is obtained by get_locked_pte() for non-THP page acquired by the
initial follow_page_mask(). The rest of the on-stack pagevec for munlock
is filled up using pte_walk as long as pte_present() and vm_normal_page()
are sufficient to obtain the struct page.
After this patch, a 14% speedup was measured for munlocking a 56GB large
memory area with THP disabled.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jörn Engel <joern@logfs.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.
So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.
Thus when user space application track memory changes now it can detect if
vma area is renewed.
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.
We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.
Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().
3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently hugepage migration works well only for pmd-based hugepages
(mainly due to lack of testing,) so we had better not enable migration of
other levels of hugepages until we are ready for it.
Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
page table walk and check pud/pmd_huge() there, so they are safe. But the
other users (softoffline and memory hotremove) don't do this, so without
this patch they can try to migrate unexpected types of hugepages.
To prevent this, we introduce hugepage_migration_support() as an
architecture dependent check of whether hugepage are implemented on a pmd
basis or not. And on some architecture multiple sizes of hugepages are
available, so hugepage_migration_support() also checks hugepage size.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.
What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().
Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().
As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.
[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
migrate hugepage with mbind(2) after applying the enablement patch which
comes later in this series.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
as an argument, instead of taking a pointer to the list of hugepages to be
migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
simplify migrate_huge_page()"), and was OK because until now hugepage
migration is enabled only for soft-offlining which migrates only one
hugepage in a single call.
But the situation will change in the later patches in this series which
enable other users of page migration to support hugepage migration. They
can kick migration for both of normal pages and hugepages in a single
call, so we need to go back to original implementation which uses linked
lists to collect the hugepages to be migrated.
With this patch, soft_offline_huge_page() switches to use migrate_pages(),
and migrate_huge_page() is not used any more. So let's remove it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In struct gen_pool_chunk, end_addr means the end address of memory chunk
(inclusive), but in the implementation it is treated as address + size of
memory chunk (exclusive), so it points to the address plus one instead of
correct ending address.
The ending address of memory chunk plus one will cause overflow on the
memory chunk including the last address of memory map, e.g. when starting
address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
address will be 0x100000000.
Use correct ending address like starting address + size - 1.
[akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.
This patch (of 3):
It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.
If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.
The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.
[akpm@linux-foundation.org: fix UP build]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PageSwapCache() is always false when !CONFIG_SWAP, so compiler
properly discard related code. Therefore, we don't need #ifdef explicitly.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.
But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.
The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.
Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021
Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937
When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul Bolle <paul.bollee@gmail.com>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Tested-by: Kevin Hilman <khilman@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the
migratetype *after* it has been set to the preferred migratetype (if the
ownership was changed). Obviously that wouldn't have been the original
intent. (We already have a separate 'change_ownership' field to tell
whether the ownership of the pageblock was changed from the
fallback_migratetype to the preferred type.)
The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap cluster allocation is to get better request merge to improve
performance. But the cluster is shared globally, if multiple tasks are
doing swap, this will cause interleave disk access. While multiple tasks
swap is quite common, for example, each numa node has a kswapd thread
doing swap and multiple threads/processes doing direct page reclaim.
ioscheduler can't help too much here, because tasks don't send swapout IO
down to block layer in the meantime. Block layer does merge some IOs, but
a lot not, depending on how many tasks are doing swapout concurrently. In
practice, I've seen a lot of small size IO in swapout workloads.
We makes the cluster allocation per-cpu here. The interleave disk access
issue goes away. All tasks swapout to their own cluster, so swapout will
become sequential, which can be easily merged to big size IO. If one CPU
can't get its per-cpu cluster (for example, there is no free cluster
anymore in the swap), it will fallback to scan swap_map. The CPU can
still continue swap. We don't need recycle free swap entries of other
CPUs.
In my test (swap to a 2-disk raid0 partition), this improves around 10%
swapout throughput, and request size is increased significantly.
How does this impact swap readahead is uncertain though. On one side,
page reclaim always isolates and swaps several adjancent pages, this will
make page reclaim write the pages sequentially and benefit readahead. On
the other side, several CPU write pages interleave means the pages don't
live _sequentially_ but relatively _near_. In the per-cpu allocation
case, if adjancent pages are written by different cpus, they will live
relatively _far_. So how this impacts swap readahead depends on how many
pages page reclaim isolates and swaps one time. If the number is big,
this patch will benefit swap readahead. Of course, this is about
sequential access pattern. The patch has no impact for random access
pattern, because the new cluster allocation algorithm is just for SSD.
Alternative solution is organizing swap layout to be per-mm instead of
this per-cpu approach. In the per-mm layout, we allocate a disk range for
each mm, so pages of one mm live in swap disk adjacently. per-mm layout
has potential issues of lock contention if multiple reclaimers are swap
pages from one mm. For a sequential workload, per-mm layout is better to
implement swap readahead, because pages from the mm are adjacent in disk.
But per-cpu layout isn't very bad in this workload, as page reclaim always
isolates and swaps several pages one time, such pages will still live in
disk sequentially and readahead can utilize this. For a random workload,
per-mm layout isn't beneficial of request merge, because it's quite
possible pages from different mm are swapout in the meantime and IO can't
be merged in per-mm layout. while with per-cpu layout we can merge
requests from any mm. Considering random workload is more popular in
workloads with swap (and per-cpu approach isn't too bad for sequential
workload too), I'm choosing per-cpu layout.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap can do cluster discard for SSD, which is good, but there are some
problems here:
1. swap do the discard just before page reclaim gets a swap entry and
writes the disk sectors. This is useless for high end SSD, because an
overwrite to a sector implies a discard to original sector too. A
discard + overwrite == overwrite.
2. the purpose of doing discard is to improve SSD firmware garbage
collection. Idealy we should send discard as early as possible, so
firmware can do something smart. Sending discard just after swap entry
is freed is considered early compared to sending discard before write.
Of course, if workload is already bound to gc speed, sending discard
earlier or later doesn't make
3. block discard is a sync API, which will delay scan_swap_map()
significantly.
4. Write and discard command can be executed parallel in PCIe SSD.
Making swap discard async can make execution more efficiently.
This patch makes swap discard async and moves discard to where swap entry
is freed. Discard and write have no dependence now, so above issues can
be avoided. Idealy we should do discard for any freed sectors, but some
SSD discard is very slow. This patch still does discard for a whole
cluster.
My test does a several round of 'mmap, write, unmap', which will trigger a
lot of swap discard. In a fusionio card, with this patch, the test
runtime is reduced to 18% of the time without it, so around 5.5x faster.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to
20~30% CPU time (when cluster is hard to find, the CPU time can be up to
80%), which becomes a bottleneck. scan_swap_map() scans a byte array to
search a 256 page cluster, which is very slow.
Here I introduced a simple algorithm to search cluster. Since we only
care about 256 pages cluster, we can just use a counter to track if a
cluster is free. Every 256 pages use one int to store the counter. If
the counter of a cluster is 0, the cluster is free. All free clusters
will be added to a list, so searching cluster is very efficient. With
this, scap_swap_map() overhead disappears.
This might help low end SD card swap too. Because if the cluster is
aligned, SD firmware can do flash erase more efficiently.
We only enable the algorithm for SSD. Hard disk swap isn't fast enough
and has downside with the algorithm which might introduce regression (see
below).
The patch slightly changes which cluster is choosen. It always adds free
cluster to list tail. This can help wear leveling for low end SSD too.
And if no cluster found, the scan_swap_map() will do search from the end
of last cluster. So if no cluster found, the scan_swap_map() will do
search from the end of last free cluster, which is random. For SSD, this
isn't a problem at all.
Another downside is the cluster must be aligned to 256 pages, which will
reduce the chance to find a cluster. I would expect this isn't a big
problem for SSD because of the non-seek penality. (And this is the reason
I only enable the algorithm for SSD).
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.
UP systems do not do remote TLB flushes, so compile those counters out on
UP.
arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4. It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I was investigating some TLB flush scaling issues and realized that we do
not have any good methods for figuring out how many TLB flushes we are
doing.
It would be nice to be able to do these in generic code, but the
arch-independent calls don't explicitly specify whether we actually need
to do remote flushes or not. In the end, we really need to know if we
actually _did_ global vs. local invalidations, so that leaves us with few
options other than to muck with the counters from arch-specific code.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simple cleanup. Every user of vma_set_policy() does the same work, this
looks a bit annoying imho. And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Read block device partition table from command line. The partition used
for fixed block device (eMMC) embedded device. It is no MBR, save
storage space. Bootloader can be easily accessed by absolute address of
data on the block device. Users can easily change the partition.
This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
About the partition verbose reference
"Documentation/block/cmdline-partition.txt"
[akpm@linux-foundation.org: fix printk text]
[yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com>
Cc: Karel Zak <kzak@redhat.com>
Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com>
Cc: Marius Groeger <mag@sysgo.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_struct->pid/tgid should go away.
1. Change same_thread_group() to use task->signal for comparison.
2. Change has_group_leader_pid(task) to compare task_pid(task) with
signal->leader_pid.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit c846ef7deb ("include/linux/smp.h:on_each_cpu(): switch
back to a macro"). It turns out that the problematic linux/irqflags.h
include was fixed within ia64 and mn10300.
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Daney <david.daney@cavium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Brown paper bag fix in HTB scheduler, class options set incorrectly
due to a typoe. Fix from Vimalkumar.
2) It's possible for the ipv6 FIB garbage collector to run before all
the necessary datastructure are setup during init, defer the
notifier registry to avoid this problem. Fix from Michal Kubecek.
3) New i40e ethernet driver from the Intel folks.
4) Add new qmi wwan device IDs, from Bjørn Mork.
5) Doorbell lock in bnx2x driver is not initialized properly in some
configurations, fix from Ariel Elior.
6) Revert an ipv6 packet option padding change that broke standardized
ipv6 implementation test suites. From Jiri Pirko.
7) Fix synchronization of ARP information in bonding layer, from
Nikolay Aleksandrov.
8) Fix missing error return resulting in illegal memory accesses in
openvswitch, from Daniel Borkmann.
9) SCTP doesn't signal poll events properly due to mistaken operator
precedence, fix also from Daniel Borkmann.
10) __netdev_pick_tx() passes wrong index to sk_tx_queue_set() which
essentially disables caching of TX queue in sockets :-/ Fix from
Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
net_sched: htb: fix a typo in htb_change_class()
net: qmi_wwan: add new Qualcomm devices
ipv6: don't call fib6_run_gc() until routing is ready
net: tilegx driver: avoid compiler warning
fib6_rules: fix indentation
irda: vlsi_ir: Remove casting the return value which is a void pointer
irda: donauboe: Remove casting the return value which is a void pointer
net: fix multiqueue selection
net: sctp: fix smatch warning in sctp_send_asconf_del_ip
net: sctp: fix bug in sctp_poll for SOCK_SELECT_ERR_QUEUE
net: fib: fib6_add: fix potential NULL pointer dereference
net: ovs: flow: fix potential illegal memory access in __parse_flow_nlattrs
bcm63xx_enet: remove deprecated IRQF_DISABLED
net: korina: remove deprecated IRQF_DISABLED
macvlan: Move skb_clone check closer to call
qlcnic: Fix warning reported by kbuild test robot.
bonding: fix bond_arp_rcv setting and arp validate desync state
bonding: fix store_arp_validate race with mode change
ipv6/exthdrs: accept tlv which includes only padding
bnx2x: avoid atomic allocations during initialization
...
When loading the ipv6 module, ndisc_init() is called before
ip6_route_init(). As the former registers a handler calling
fib6_run_gc(), this opens a window to run the garbage collector
before necessary data structures are initialized. If a network
device is initialized in this window, adding MAC address to it
triggers a NETDEV_CHANGEADDR event, leading to a crash in
fib6_clean_all().
Take the event handler registration out of ndisc_init() into a
separate function ndisc_late_init() and move it after
ip6_route_init().
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first fixes namespace issues which causes a kernel
NULL pointer dereference, the second fixes uevent
handling to work better with udev, and the third
switches some code to use srlcpy instead of strncpy
in order to be safer.
All changes have been baking in for-next for at least
2 weeks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
iQIcBAABAgAGBQJSMJjZAAoJEDZk62b0Tg6x81sQAKa60QStBKhnL65bvG+ooIsS
mhwfmFyaWOKw1ezwY2Vk0+JnmKDBpKmqjjwyL3nLP18TcRZStPiFdcJBKWl+czge
FTv14t54CcjysYPbYN7+gUap4F5mfg0mcHaR0UGow505dNyjwd7mqkZhy1IqhdvP
Ue/h0RE46GeNtdirxrKBdEfW/7TAL0tcoRgjKu0ev1V2sXCJZywuXgkzWjByRXwT
JOg04gGnYThuek0/KUPRhf0KxB0CyKrZiics7LGb40HkYYxs7ahADACttLyiDr8l
GntfHXLgvVlU5QcSbKRfLp0zNbi7AxWmJrwYsEwpas4tUw1Q+pVJ2EE2Ameuq5G+
LrMGmRVQCVYw8UN+OYUO7glhXEJcCPJj6vxgm+NVXx24yaQyGI1aTsIEjHwZ/hkm
wlQHC47z6/fIypkXpsU6pYWF/r3GwXHokYReejATQWEPIzIxvHeThe0jjqMLth7F
zmsHZTpmECqtti1fizy5wBZD25wAIxdf+rf8nKy1VvcSN4s08ESSlC/kV/siNeko
efFnL8xbjP5SPEVoBtXM6eTDHrQ0S+ACSGWtp0FGXKOW4PKzS60ve2Stp+FYZgQc
WgXI7+NBU6Z9z+cZ9bsY0hrGwK1YZiR4F3KJ5ofTuxAO6n7zd+N3fGBuQJ2tiW9P
pKtIXNozWqnAU9Wx4rGa
=YbFT
-----END PGP SIGNATURE-----
Merge tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p updates from Eric Van Hensbergen:
"Minor 9p fixes and tweaks for 3.12 merge window
The first fixes namespace issues which causes a kernel NULL pointer
dereference, the second fixes uevent handling to work better with
udev, and the third switches some code to use srlcpy instead of
strncpy in order to be safer.
All changes have been baking in for-next for at least 2 weeks"
* tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: avoid accessing utsname after namespace has been torn down
9p: send uevent after adding/removing mount_tag attribute
fs: 9p: use strlcpy instead of strncpy