With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is
attempted.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com>
Signed-off-by: Florian Schmaus <fschmaus@gmail.com>
Signed-off-by: Andor Daam <andor.daam@googlemail.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu <lliubbo@gmail.com> recommendations]
[v4: Fix up per Andrew's comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second batch of fixes from Andrew Morton:
- various misc bits
- some printk updates
- a new "SRAM" driver.
- MAINTAINERS updates
- the backlight driver queue
- checkpatch updates
- a few init/ changes
- a huge number of drivers/rtc changes
- fatfs updates
- some lib/idr.c work
- some renaming of the random driver interfaces
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (285 commits)
net: rename random32 to prandom
net/core: remove duplicate statements by do-while loop
net/core: rename random32() to prandom_u32()
net/netfilter: rename random32() to prandom_u32()
net/sched: rename random32() to prandom_u32()
net/sunrpc: rename random32() to prandom_u32()
scsi: rename random32() to prandom_u32()
lguest: rename random32() to prandom_u32()
uwb: rename random32() to prandom_u32()
video/uvesafb: rename random32() to prandom_u32()
mmc: rename random32() to prandom_u32()
drbd: rename random32() to prandom_u32()
kernel/: rename random32() to prandom_u32()
mm/: rename random32() to prandom_u32()
lib/: rename random32() to prandom_u32()
x86: rename random32() to prandom_u32()
x86: pageattr-test: remove srandom32 call
uuid: use prandom_bytes()
raid6test: use prandom_bytes()
sctp: convert sctp_assoc_set_id() to use idr_alloc_cyclic()
...
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
Use preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg is not referenced, so it can be destroyed at anytime right
after we exit rcu read section, so it's not safe to access it.
To fix this, we call css_tryget() to get a reference while we're still
in rcu read section.
This also removes a bogus comment above __memcg_create_cache_enqueue().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are times when HIGHMEM is enabled, but we don't prefer
CONFIG_BOUNCE to be enabled. CONFIG_BOUNCE can reduce the block device
throughput, and this is not ideal for machines where we don't gain much
by enabling it. So provide an option to deselect CONFIG_BOUNCE. The
observation was made while measuring eMMC throughput using iozone on an
ARM device with 1GB RAM.
Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we do memset() before reserving the area. This may not cause
any problem, but it is somewhat weird. So change execution order.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused argument and make function static, because there is no user
outside of nobootmem.c
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PFN_PHYS() is a phys_addr_t, which can be u32 or u64.
Fix the build warning when phys_addr_t is u32.
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'unsigned int' [-Wformat]: => 1685:3
mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]: => 1685:3
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As pointed out by Andrew Morton, the swap-over-NFS writeback is not
setting PageWriteback before it is queued for direct IO. While swap
pages do not participate in BDI or process dirty accounting and the IO
is synchronous, the writeback bit is still required and not setting it
in this case was an oversight. swapoff depends on the page writeback to
synchronoise all pending writes on a swap page before it is reused.
Swapcache freeing and reuse depend on checking the PageWriteback under
lock to ensure the page is safe to reuse.
Direct IO handlers and the direct IO handler for NFS do not deal with
PageWriteback as they are synchronous writes. In the case of NFS, it
schedules pages (or a page in the case of swap) for IO and then waits
synchronously for IO to complete in nfs_direct_write(). It is
recognised that this is a slowdown from normal swap handling which is
asynchronous and uses a completion handler. Shoving PageWriteback
handling down into direct IO handlers looks like a bad fit to handle the
swap case although it may have to be dealt with some day if swap is
converted to use direct IO in general and bmap is finally done away
with. At that point it will be necessary to refit asynchronous direct
IO with completion handlers onto the swap subsystem.
As swapcache currently depends on PageWriteback to protect against
races, this patch sets PageWriteback under the page lock before queueing
it for direct IO. It is cleared when the direct IO handler returns. IO
errors are treated similarly to the direct-to-bio case except PageError
is not set as in the case of swap-over-NFS, it is likely to be a
transient error.
It was asked what prevents such a page being reclaimed in parallel.
With this patch applied, such a page will now be skipped (most of the
time) or blocked until the writeback completes. Reclaim checks
PageWriteback under the page lock before calling try_to_free_swap and
the page lock should prevent the page being requeued for IO before it is
freed.
This and Jerome's related patch should considered for -stable as far
back as 3.6 when swap-over-NFS was introduced.
[akpm@linux-foundation.org: use pr_err_ratelimited()]
[akpm@linux-foundation.org: remove hopefully-unneeded cast in printk]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 62c230bc17 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages"), swap_writepage()
calls direct_IO on swap files. However, in that case the page isn't
redirtied if I/O fails, and is therefore handled afterwards as if it has
been successfully written to the swap file, leading to memory corruption
when the page is eventually swapped back in.
This patch sets the page dirty when direct_IO() fails. It fixes a
memory corruption that happened while using swap-over-NFS.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A memcg may livelock when oom if the process that grabs the hierarchy's
oom lock is never the first process with PF_EXITING set in the memcg's
task iteration.
The oom killer, both global and memcg, will defer if it finds an
eligible process that is in the process of exiting and it is not being
ptraced. The idea is to allow it to exit without using memory reserves
before needlessly killing another process.
This normally works fine except in the memcg case with a large number of
threads attached to the oom memcg. In this case, the memcg oom killer
only gets called for the process that grabs the hierarchy's oom lock;
all others end up blocked on the memcg's oom waitqueue. Thus, if the
process that grabs the hierarchy's oom lock is never the first
PF_EXITING process in the memcg's task iteration, the oom killer is
constantly deferred without anything making progress.
The fix is to give PF_EXITING processes access to memory reserves so
that we've marked them as oom killed without any iteration. This allows
__mem_cgroup_try_charge() to succeed so that the process may exit. This
makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
immediately granted for processes with pending SIGKILLs and those in the
exit path, to be equivalent to what is done for the global oom killer.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current implementation of huge zero page uses pfn value 0 to indicate
that the page hasn't allocated yet. It assumes that buddy page
allocator can't return page with pfn == 0.
Let's rework the code to store 'struct page *' of huge zero page, not
its pfn. This way we can avoid the weak assumption.
[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This might cause a use-after-free bug.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two convenient ways to report errors to userspace
1) retun error to original syscall for example write(2)
2) mark mapping with error flag and return it on later fsync(2)
Second one is broken if (mapping->nrpages == 0) This is real-life
situation because after error pages are likey to be truncated or
invalidated.
We have to return an error regardless to number of pages in the mapping.
#Original testcase: git@github.com:dmonakhov/xfstests.git
MOUNT_OPTIONS="-b1024"
./check shared/305
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no comment for parameter nid of memblock_insert_region().
This patch adds comment for it.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.
Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916
With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272
fallback allocation is reduced a lot.
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prevent flooding the swap device with writebacks, frontswap backends
need to count and limit the number of outstanding writebacks. The
incrementing of the counter can be done before the call to
__swap_writepage(). However, the caller must receive a notification
when the writeback completes in order to decrement the counter.
To achieve this functionality, this patch modifies __swap_writepage() to
take the bio completion callback function as an argument.
end_swap_bio_write(), the normal bio completion function, is also made
non-static so that code doing the accounting can call it after the
accounting is done.
There should be no behavioural change to existing code.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_writepage() is currently where frontswap hooks into the swap write
path to capture pages with the frontswap_store() function. However, if
a frontswap backend wants to "resume" the writeback of a page to the
swap device, it can't call swap_writepage() as the page will simply
reenter the backend.
This patch separates swap_writepage() into a top and bottom half, the
bottom half named __swap_writepage() to allow a frontswap backend, like
zswap, to resume writeback beyond the frontswap_store() hook.
__add_to_swap_cache() is also made non-static so that the page for which
writeback is to be resumed can be added to the swap cache.
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a corner case for MAP_FIXED when requested mapping length is larger
than rlimit for virtual memory. In such case any overlapping mappings
are unmapped before we check for the limit and return ENOMEM.
The check is moved before the loop that unmaps overlapping parts of
existing mappings. When we are about to hit the limit (currently mapped
pages + len > limit) we scan for overlapping pages and check again
accounting for them.
This fixes situation when userspace program expects that the previous
mappings are preserved after the mmap() syscall has returned with error.
(POSIX clearly states that successfull mapping shall replace any
previous mappings.)
This corner case was found and can be tested with LTP testcase:
testcases/open_posix_testsuite/conformance/interfaces/mmap/24-2.c
In this case the mmap, which is clearly over current limit, unmaps
dynamic libraries and the testcase segfaults right after returning into
userspace.
I've also looked at the second instance of the unmapping loop in the
do_brk(). The do_brk() is called from brk() syscall and from vm_brk().
The brk() syscall checks for overlapping mappings and bails out when
there are any (so it can't be triggered from the brk syscall). The
vm_brk() is called only from binmft handlers so it shouldn't be
triggered unless binmft handler created overlapping mappings.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In madvise(), there doesn't seem to be any reason for taking the
¤t->mm->mmap_sem before start and len_in have been validated.
Incidentally, this removes the need for the out: label.
[akpm@linux-foundation.org: s/out_plug/out/, per David]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__remove_pages() is only necessary for CONFIG_MEMORY_HOTREMOVE. PowerPC
pseries will return -EOPNOTSUPP if unsupported.
Adding an #ifdef causes several other functions it depends on to also
become unnecessary, which saves in .text when disabled (it's disabled in
most defconfigs besides powerpc, including x86). remove_memory_block()
becomes static since it is not referenced outside of
drivers/base/memory.c.
Build tested on x86 and powerpc with CONFIG_MEMORY_HOTREMOVE both enabled
and disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change __remove_pages() to call release_mem_region_adjustable(). This
allows a requested memory range to be released from the iomem_resource
table even if it does not match exactly to an resource entry but still
fits into. The resource entries initialized at bootup usually cover the
whole contiguous memory ranges and may not necessarily match with the
size of memory hot-delete requests.
If release_mem_region_adjustable() failed, __remove_pages() emits a
warning message and continues to proceed as it was the case with
release_mem_region(). release_mem_region(), which is defined to
__release_region(), emits a warning message and returns no error since a
void function.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by : Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: T Makphaibulchoke <tmac@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment over migrate_pages() looks quite weird, and makes it hard to
grasp what it is trying to say. Rewrite it more comprehensibly.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the memory barrier in __do_huge_pmd_anonymous_page doesn't
work. Because lru_cache_add_lru uses pagevec so it could miss spinlock
easily so above rule was broken so user might see inconsistent data.
I was not first person who pointed out the problem. Mel and Peter
pointed out a few months ago and Peter pointed out further that even
spin_lock/unlock can't make sure of it:
http://marc.info/?t=134333512700004
In particular:
*A = a;
LOCK
UNLOCK
*B = b;
may occur as:
LOCK, STORE *B, STORE *A, UNLOCK
At last, Hugh pointed out that even we don't need memory barrier in
there because __SetPageUpdate already have done it from Nick's commit
0ed361dec3 ("mm: fix PageUptodate data race") explicitly.
So this patch fixes comment on THP and adds same comment for
do_anonymous_page, too because everybody except Hugh was missing that.
It means we need a comment about that.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_HOTPLUG is going away as an option, cleanup CONFIG_HOTPLUG
ifdefs in mm files.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just a trivial issue I stumbled on while doing something else...
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alter the admin and user reserves of the previous patches in this series
when memory is added or removed.
If memory is added and the reserves have been eliminated or increased
above the default max, then we'll trust the admin.
If memory is removed and there isn't enough free memory, then we need to
reset the reserves.
Otherwise keep the reserve set by the admin.
The reserve reset code is the same as the reserve initialization code.
I tested hot addition and removal by triggering it via sysfs. The
reserves shrunk when they were set high and memory was removed. They
were reset higher when memory was added again.
[akpm@linux-foundation.org: use register_hotmemory_notifier()]
[akpm@linux-foundation.org: init_user_reserve() and init_admin_reserve can no longer be __meminit]
[fengguang.wu@intel.com: make init_reserve_notifier() static]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
memory reserve to something other than 3%, which may be multiple
gigabytes on large memory systems. Only about 8MB is necessary to
enable recovery in the default mode, and only a few hundred MB are
required even when overcommit is disabled.
This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.
admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
Please see first patch in this series for full background, motivation,
testing, and full changelog.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_admin_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Squishes a statement-with-no-effect warning, removes some ifdefs and
shrinks .text by 2 bytes.
Note that this code fails to check for blocking_notifier_chain_register()
failures.
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
page_alloc provides but makes static. Make it avaliable to the archs in
linux/mm.h.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When booting on a large memory system, the kernel spends considerable
time in memmap_init_zone() setting up memory zones. Analysis shows
significant time spent in __early_pfn_to_nid().
The routine memmap_init_zone() checks each PFN to verify the nid is
valid. __early_pfn_to_nid() sequentially scans the list of pfn ranges
to find the right range and returns the nid. This does not scale well.
On a 4 TB (single rack) system there are 308 memory ranges to scan. The
higher the PFN the more time spent sequentially spinning through memory
ranges.
Since memmap_init_zone() increments pfn, it will almost always be
looking for the same range as the previous pfn, so check that range
first. If it is in the same range, return that nid. If not, scan the
list as before.
A 4 TB (single rack) UV1 system takes 512 seconds to get through the
zone code. This performance optimization reduces the time by 189
seconds, a 36% improvement.
A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
a 112.9 second (53%) reduction.
[akpm@linux-foundation.org: make the statics __meminitdata]
[akpm@linux-foundation.org: fix comment formatting]
[akpm@linux-foundation.org: fix ia64, per yinghai]
[akpm@linux-foundation.org: add missing semicolon, per Tony]
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following problem was reported against a distribution kernel when
zone_reclaim was enabled but the same problem applies to the mainline
kernel. The reproduction case was as follows
1. Run numactl -m +0 dd if=largefile of=/dev/null
This allocates a large number of clean pages in node 0
2. numactl -N +0 memhog 0.5*Mg
This start a memory-using application in node 0.
The expected behaviour is that the clean pages get reclaimed and the
application uses node 0 for its memory. The observed behaviour was that
the memory for the memhog application was allocated off-node since
commits cd38b115d5 ("mm: page allocator: initialise ZLC for first zone
eligible for zone_reclaim") and commit 76d3fbf8fb ("mm: page
allocator: reconsider zones for allocation after direct reclaim").
The assumption of those patches was that it was always preferable to
allocate quickly than stall for long periods of time and they were meant
to take care that the zone was only marked full when necessary but an
important case was missed.
In the allocator fast path, only the low watermarks are checked. If the
zones free pages are between the low and min watermark then allocations
from the allocators slow path will succeed. However, zone_reclaim will
only reclaim SWAP_CLUSTER_MAX or 1<<order pages. There is no guarantee
that this will meet the low watermark causing the zone to be marked full
prematurely.
This patch will only mark the zone full after zone_reclaim if it the min
watermarks are checked or if page reclaim failed to make sufficient
progress.
[mhocko@suse.cz: fix alloc_flags test]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Hedi Berriche <hedi@sgi.com>
Tested-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sparse code, when asking the architecture to populate the vmemmap,
specifies the section range as a starting page and a number of pages.
This is an awkward interface, because none of the arch-specific code
actually thinks of the range in terms of 'struct page' units and always
translates it to bytes first.
In addition, later patches mix huge page and regular page backing for
the vmemmap. For this, they need to call vmemmap_populate_basepages()
on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind. But these
are not necessarily multiples of the 'struct page' size and so this unit
is too coarse.
Just translate the section range into bytes once in the generic sparse
code, then pass byte ranges down the stack.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: David S. Miller <davem@davemloft.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hot-adding memory on x86_64 normally requires huge page allocation.
When this is done to a VM guest, it's usually because the system is
already tight on memory, so the request tends to fail. Try to avoid
this by adding __GFP_REPEAT to the allocation flags.
Addresses http://bugs.debian.org/699913
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Tested-by: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Particularly in oom conditions, it's troublesome that hugetlb memory is
not displayed. All other meminfo that is emitted will not add up to
what is expected, and there is no artifact left in the kernel log to
show that a potentially significant amount of memory is actually
allocated as hugepages which are not available to be reclaimed.
Booting with hugepages=8192 on the command line, this memory is now
shown in oom conditions. For example, with echo m >
/proc/sysrq-trigger:
Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On architectures where a pgd entry may be shared between user and kernel
(e.g. ARM+LPAE), freeing page tables needs a ceiling other than 0.
This patch introduces a generic USER_PGTABLES_CEILING that arch code can
override. It is the responsibility of the arch code setting the ceiling
to ensure the complete freeing of the page tables (usually in
pgd_free()).
[catalin.marinas@arm.com: commit log; shift_arg_pages(), asm-generic/pgtables.h changes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: <stable@vger.kernel.org> [3.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2d11085e40 ("memcg: do not create memsw files if swap
accounting is disabled") memsw files are created only if memcg swap
accounting is enabled so it doesn't make any sense to check for it
explicitly in mem_cgroup_read(), mem_cgroup_write() and
mem_cgroup_reset().
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the WARN_ON_ONCE(!mm) check as the comment suggested. Kernel
code calls find_vma only when it is absolutely sure that the mm_struct
arg to it is non-NULL.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: k80c <k80ck80c@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start). The address which
contains vmalloc_start value is represented as below:
vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)
However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.
So this patch exports them externally with small cleanup.
[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, there is no need to maintain vmlist after initializing vmalloc. So
remove related code and data structure.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec. kexec dumps address of vmlist and
makedumpfile uses this information.
We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose, we
export vmap_area_list, instead of vmlist.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory step for removing vmlist entirely. For
above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.
Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP
system. In s_show(), we retrieve some values from vm_struct.
vm_struct's values is not fully setup when va->vm is assigned. Full
setup is notified by removing VM_UNLIST flag without holding a lock.
When we see that VM_UNLIST is removed, it is not ensured that vm_struct
has proper values in view of other CPUs. So we need smp_[rw]mb for
ensuring that proper values is assigned when we see that VM_UNLIST is
removed.
Therefore, this patch not only change a iteration list, but also add a
appropriate smp_[rw]mb to right places.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory step for removing vmlist entirely. For
above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.
vmlist is lack of information about some areas in vmalloc address space.
For example, vm_map_ram() allocate area in vmalloc address space, but it
doesn't make a link with vmlist. To provide full information about
vmalloc address space is better idea, so we don't use va->vm and use
vmap_area directly. This makes get_vmalloc_info() more precise.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, when we hold a vmap_area_lock, va->vm can't be discarded. So we can
safely access to va->vm when iterating a vmap_area_list with holding a
vmap_area_lock. With this property, change iterating vmlist codes in
vread/vwrite() to iterating vmap_area_list.
There is a little difference relate to lock, because vmlist_lock is mutex,
but, vmap_area_lock is spin_lock. It may introduce a spinning overhead
during vread/vwrite() is executing. But, these are debug-oriented
functions, so this overhead is not real problem for common case.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inserting and removing an entry to vmlist is linear time complexity, so
it is inefficient. Following patches will try to remove vmlist
entirely. This patch is preparing step for it.
For removing vmlist, iterating vmlist codes should be changed to
iterating a vmap_area_list. Before implementing that, we should make
sure that when we iterate a vmap_area_list, accessing to va->vm doesn't
cause a race condition. This patch ensure that when iterating a
vmap_area_list, there is no race condition for accessing to vm_struct.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
code must be here and it's implementation needs vmlist_lock and it iterate
a vmlist which may be internal data structure for vmalloc.
It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Walking a bio's page mappings has proved problematic, so create a new
bio flag to indicate that a bio's data needs to be snapshotted in order
to guarantee stable pages during writeback. Next, for the one user
(ext3/jbd) of snapshotting, hook all the places where writes can be
initiated without PG_writeback set, and set BIO_SNAP_STABLE there.
We must also flag journal "metadata" bios for stable writeout, since
file data can be written through the journal. Finally, the
MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
rid of it.
[akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit abf09bed3c ("s390/mm: implement software dirty bits")
introduced another difference in the pte layout vs. the pmd layout on
s390, thoroughly breaking the s390 support for hugetlbfs. This requires
replacing some more pte_xxx functions in mm/hugetlbfs.c with a
huge_pte_xxx version.
This patch introduces those huge_pte_xxx functions and their generic
implementation in asm-generic/hugetlb.h, which will now be included on
all architectures supporting hugetlbfs apart from s390. This change
will be a no-op for those architectures.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz> [for !s390 parts]
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk). The code
would be much more easier to follow if we move the iteration outside of
the function (to __mem_cgrou_iter_next) so the distinction is more
clear. This patch doesn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current implementation of mem_cgroup_iter has to consider both css
and memcg to find out whether no group has been found (css==NULL - aka
the loop is completed) and that no memcg is associated with the found
node (!memcg - aka css_tryget failed because the group is no longer
alive). This leads to awkward tweaks like tests for css && !memcg to
skip the current node.
It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.
We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the per-node-zone-priority iterator caches memory cgroups
rather than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).
We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.
This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented
upwards the hierarchy every time a group is removed.
The iter_lock can be dropped because racing iterators cannot leak the
reference anymore as the reference count is not elevated for
last_visited when it is cached.
Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see a
valid last_visited pointer then it will be valid for the whole RCU walk.
smp_rmb makes sure that dead_count is read before last_visited and
last_dead_count while smp_wmb makes sure that last_visited is updated
before last_dead_count so the up-to-date last_dead_count cannot point to
an outdated last_visited. css_tryget then makes sure that the
last_visited is still alive in case the iteration races with the cached
group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).
In short:
mem_cgroup_iter
rcu_read_lock()
dead_count = atomic_read(parent->dead_count)
smp_rmb()
if (dead_count != iter->last_dead_count)
last_visited POSSIBLY INVALID -> last_visited = NULL
if (!css_tryget(iter->last_visited))
last_visited DEAD -> last_visited = NULL
next = find_next(last_visited)
css_tryget(next)
css_put(last_visited) // css would be invalidated and parent->dead_count
// incremented if this was the last reference
iter->last_visited = next
smp_wmb()
iter->last_dead_count = dead_count
rcu_read_unlock()
cgroup_rmdir
cgroup_destroy_locked
atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
mem_cgroup_css_offline
mem_cgroup_invalidate_reclaim_iterators
while(parent = parent_mem_cgroup)
atomic_inc(parent->dead_count)
css_put(css) // last reference held by cgroup core
Spotted by Ying Han.
Original idea from Johannes Weiner.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node is
visited before its children.
Example:
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d
Commit 574bd9f7c7 ("cgroup: implement generic child / descendant walk
macros") has introduced generic cgroup tree walkers which provide either
pre-order or post-order tree walk. This patch converts css->id based
iteration to pre-order tree walk to keep the semantic with the original
iterator where parent is always visited before its subtree.
cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp. it signals that the
group is dead and it should be skipped.
If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).
Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically. Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patchset tries to make mem_cgroup_iter saner in the way how it walks
hierarchies. css->id based traversal is far from being ideal as it is not
deterministic because it depends on the creation ordering. Additional to
that css_id is considered a burden for cgroup maintainers because it is
quite some code and memcg is the last user of it. After this series only
the swap accounting uses css_id but that one will follow up later.
Diffstat (if we exclude removed/added comments) looks quite
promising. We got rid of some code:
$ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
b/include/linux/cgroup.h | 3 ---
kernel/cgroup.c | 33 ---------------------------------
mm/memcontrol.c | 4 +++-
3 files changed, 3 insertions(+), 37 deletions(-)
The first patch is just preparatory and it changes when we release css of
the previously returned memcg. Nothing controlversial.
The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those groups'
css to keep them alive.
I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I was
describing should be mostly harmless and I haven't been able to trigger it
so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.
The next patch fixups an unbounded cgroup removal holdoff caused by the
elevated css refcount. The issue has been observed by Ying Han. Johannes
wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested a
different way when the iterator checks whether a cached memcg is still
valid or no. More on that in the patch but the basic idea is that every
memcg tracks the number removed subgroups and iterator records this number
when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.
The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is moved
to a helper so that the reference counting and iteration are separated.
The last patch just removes css_get_next as there is no user for it any
longer.
My testing looked as follows:
A (use_hierarchy=1, limit_in_bytes=150M)
/|\
1 2 3
Children groups were created so that the number is never higher than 3 and
their limits were random between 50-100M. Each group hosts a kernel build
(starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
and terminated after random time - up to 5 minutes) and then it is
removed.
This should exercise both leaf and hierarchical reclaim as well as races
with cgroup removals and debugging messages I added on top proved that.
100 groups were created during the test.
This patch:
css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
releases the reference right after it gets the last css_id.
This is correct because neither prev's memcg nor cgroup are accessed after
then. This will change in the next patch so we need to hold the group
alive a bit longer so let's move the css_put at the end of the function.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.
This is the second part, which applies to the previous part at:
http://marc.info/?l=linux-mm&m=136289696323825&w=2
It introduces a helper function free_highmem_page() to free highmem
pages into the buddy system when initializing mm subsystem.
Introduction of free_highmem_page() is one step forward to clean up
accesses and modificaitons of totalhigh_pages, totalram_pages and
zone->managed_pages etc. I hope we could remove all references to
totalhigh_pages from the arch/ subdirectory.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help
to test this patchset are welcomed!
There are several other parts still under development:
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Introduce helper function free_highmem_page(), which will be used by
architectures with HIGHMEM enabled to free highmem pages into the buddy
system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Attilio Rao <attilio.rao@citrix.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: David Daney <david.daney@cavium.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.
This is the first part, which applies to v3.9-rc1.
It introduces following common helper functions to simplify
free_initmem() and free_initrd_mem() on different architectures:
adjust_managed_page_count():
will be used to adjust totalram_pages, totalhigh_pages,
zone->managed_pages when reserving/unresering a page.
__free_reserved_page():
free a reserved page into the buddy system without adjusting
page statistics info
free_reserved_page():
free a reserved page into the buddy system and adjust page
statistics info
mark_page_reserved():
mark a page as reserved and adjust page statistics info
free_reserved_area():
free a continous ranges of pages by calling free_reserved_page()
free_initmem_default():
default method to free __init pages.
We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help to
test this patchset are welcomed!
There are several other parts still under development:
Part2: introduce free_highmem_page() to simplify freeing highmem pages
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.
This patch:
Code to deal with reserved/managed pages are duplicated by many
architectures, so introduce common help functions to reduce duplicated
code. These common help functions will also be used to concentrate code
to modify totalram_pages and zone->managed_pages, which makes the code
much more clear.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Local variable total_scanned is no longer used.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a typo "end_pft" in the comment of walk_memory_range().
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This came to light when calling memblock allocator from arc port (for
copying flattended DT). If a "0" alignment is passed, the allocator
round_up() call incorrectly rounds up the size to 0.
round_up(num, alignto) => ((num - 1) | (alignto -1)) + 1
While the obvious allocation failure causes kernel to panic, it is better
to warn the caller to fix the code.
Tejun suggested that instead of BUG_ON(!align) - which might be
ineffective due to pending console init and such, it is better to WARN_ON,
and continue the boot with a reasonable default align.
Caller passing @size need not be handled similarly as the subsequent
panic will indicate that anyhow.
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have to recompute pgoff if the given page is huge, since result based
on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
shown by commit 36e4f20af8 ("hugetlb: do not use vma_hugecache_offset()
for vma_prio_tree_foreach").
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Create a CONFIG_MMU=y stub for ramfs_nommu_expand_for_mapping() in the
usual fashion.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Wolfram Sang <wsa@the-dreams.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On large systems with a lot of memory, walking all RAM to determine page
types may take a half second or even more.
In non-blockable contexts, the page allocator will emit a page allocation
failure warning unless __GFP_NOWARN is specified. In such contexts, irqs
are typically disabled and such a lengthy delay may even result in NMI
watchdog timeouts.
To fix this, suppress the page walk in such contexts when printing the
page allocation failure warning.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the events API to trace filemap loading and unloading of file pieces
into the page cache.
This patch aims at tracing the eviction reload cycle of executable and
shared libraries pages in a memory constrained environment.
The typical usage is to spot a specific device and inode (for example
/lib/libc.so) to see the eviction cycles, and find out if frequently
used code is rather spread across many pages (bad) or coallesced (good).
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently page_action() does not check dirty flag to determine whether
the error page is "clean mlocked/unevictable LRU" page. This doesn't
cause any misjudgement because we do matching against "dirty
mlocked/unevictable LRU" just before the check. But in order to make
code consistent and/or to avoid potential regression, we had better
check dirty flag explicitly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Chen Gong <gong.chen@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's the big char / misc driver update for 3.10-rc1
A number of various driver updates, the majority being new functionality
in the MEI driver subsystem (it's now a subsystem, it started out just a
single driver), extcon updates, memory updates, hyper-v updates, and a
bunch of other small stuff that doesn't fit in any other tree.
All of these have been in linux-next for a while
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlF+mtYACgkQMUfUDdst+ymFXQCfdLsD4Cxz+jkgW+tljh9i70XD
OFkAnRPMMhLS8/kddf02lLMYzYUFdy1U
=zaFJ
-----END PGP SIGNATURE-----
Merge tag 'char-misc-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver update from Greg Kroah-Hartman:
"Here's the big char / misc driver update for 3.10-rc1
A number of various driver updates, the majority being new
functionality in the MEI driver subsystem (it's now a subsystem, it
started out just a single driver), extcon updates, memory updates,
hyper-v updates, and a bunch of other small stuff that doesn't fit in
any other tree.
All of these have been in linux-next for a while"
* tag 'char-misc-3.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (148 commits)
Tools: hv: Fix a checkpatch warning
tools: hv: skip iso9660 mounts in hv_vss_daemon
tools: hv: use FIFREEZE/FITHAW in hv_vss_daemon
tools: hv: use getmntent in hv_vss_daemon
Tools: hv: Fix a checkpatch warning
tools: hv: fix checks for origin of netlink message in hv_vss_daemon
Tools: hv: fix warnings in hv_vss_daemon
misc: mark spear13xx-pcie-gadget as broken
mei: fix krealloc() misuse in in mei_cl_irq_read_msg()
mei: reduce flow control only for completed messages
mei: reseting -> resetting
mei: fix reading large reposnes
mei: revamp mei_irq_read_client_message function
mei: revamp mei_amthif_irq_read_message
mei: revamp hbm state machine
Revert "drivers/scsi: use module_pcmcia_driver() in pcmcia drivers"
Revert "scsi: pcmcia: nsp_cs: remove module init/exit function prototypes"
scsi: pcmcia: nsp_cs: remove module init/exit function prototypes
mei: wd: fix line over 80 characters
misc: tsl2550: Use dev_pm_ops
...
Use the new vsprintf extension to avoid any possible
message interleaving.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
I think we could just move the full vm_iomap_memory() function into
util.h or similar, but I didn't get any reply from anybody actually
using nommu even to this trivial patch, so I'm not going to touch it any
more than required.
Here's the fairly minimal stub to make the nommu case at least
potentially work. It doesn't seem like anybody cares, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the error return value in kswapd_run(). The bug was introduced by
commit d5dc0ad928 ("mm/vmscan: fix error number for failed kthread").
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
error happens on a hugepage and the affected processes try to access the
error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in
get_page().
The reason for this bug is that coredump-related code doesn't recognise
"hugepage hwpoison entry" with which a pmd entry is replaced when a memory
error occurs on a hugepage.
In other words, physical address information is stored in different bit
layout between hugepage hwpoison entry and pmd entry, so
follow_hugetlb_page() which is called in get_dump_page() returns a wrong
page from a given address.
The expected behavior is like this:
absent is_swap_pte FOLL_DUMP Expected behavior
-------------------------------------------------------------------
true false false hugetlb_fault
false true false hugetlb_fault
false false false return page
true false true skip page (to avoid allocation)
false true true hugetlb_fault
false false true return page
With this patch, we can call hugetlb_fault() and take proper actions (we
wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for
hwpoisoned entries,) and as the result we can dump all hugepages except
for hwpoisoned ones.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [2.6.34+?]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various drivers end up replicating the code to mmap() their memory
buffers into user space, and our core memory remapping function may be
very flexible but it is unnecessarily complicated for the common cases
to use.
Our internal VM uses pfn's ("page frame numbers") which simplifies
things for the VM, and allows us to pass physical addresses around in a
denser and more efficient format than passing a "phys_addr_t" around,
and having to shift it up and down by the page size. But it just means
that drivers end up doing that shifting instead at the interface level.
It also means that drivers end up mucking around with internal VM things
like the vma details (vm_pgoff, vm_start/end) way more than they really
need to.
So this just exports a function to map a certain physical memory range
into user space (using a phys_addr_t based interface that is much more
natural for a driver) and hides all the complexity from the driver.
Some drivers will still end up tweaking the vm_page_prot details for
things like prefetching or cacheability etc, but that's actually
relevant to the driver, rather than caring about what the page offset of
the mapping is into the particular IO memory region.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.
It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.
v2: Explain that mem_cgroup_bind() doesn't have to worry about
children as suggested by Michal Hocko.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This patch attempts to fix:
https://bugzilla.kernel.org/show_bug.cgi?id=56461
The symptom is a crash and messages like this:
chrome: Corrupted page table at address 34a03000
*pdpt = 0000000000000000 *pde = 0000000000000000
Bad pagetable: 000f [#1] PREEMPT SMP
Ingo guesses this got introduced by commit 611ae8e3f5 ("x86/tlb:
enable tlb flush range support for x86") since that code started to free
unused pagetables.
On x86-32 PAE kernels, that new code has the potential to free an entire
PMD page and will clear one of the four page-directory-pointer-table
(aka pgd_t entries).
The hardware aggressively "caches" these top-level entries and invlpg
does not actually affect the CPU's copy. If we clear one we *HAVE* to
do a full TLB flush, otherwise we might continue using a freed pmd page.
(note, we do this properly on the population side in pud_populate()).
This patch tracks whenever we clear one of these entries in the 'struct
mmu_gather', and ensures that we follow up with a full tlb flush.
BTW, I disassembled and checked that:
if (tlb->fullmm == 0)
and
if (!tlb->fullmm && !tlb->need_flush_all)
generate essentially the same code, so there should be zero impact there
to the !PAE case.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Artem S Tashkinov <t.artem@mailcity.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As cgroup supports rename, it's unsafe to dereference dentry->d_name
without proper vfs locks. Fix this by using cgroup_name() rather than
dentry directly.
Also open code memcg_cache_name because it is called only from
kmem_cache_dup which frees the returned name right after
kmem_cache_create_memcg makes a copy of it. Such a short-lived
allocation doesn't make too much sense. So replace it by a static
buffer as kmem_cache_dup is called with memcg_cache_mutex.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As Steven Rostedt has pointer out: rescheduling could occur on a
different processor after the determination of the per cpu pointer and
before the tid is retrieved. This could result in allocation from the
wrong node in slab_alloc().
The effect is much more severe in slab_free() where we could free to the
freelist of the wrong page.
The window for something like that occurring is pretty small but it is
possible.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
The variables accessed in slab_alloc are volatile and therefore
the page pointer passed to node_match can be NULL. The processing
of data in slab_alloc is tentative until either the cmpxhchg
succeeds or the __slab_alloc slowpath is invoked. Both are
able to perform the same allocation from the freelist.
Check for the NULL pointer in node_match.
A false positive will lead to a retry of the loop in __slab_alloc.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
find_vma() can be called by multiple threads with read lock
held on mm->mmap_sem and any of them can update mm->mmap_cache.
Prevent compiler from re-fetching mm->mmap_cache, because other
readers could update it in the meantime:
thread 1 thread 2
|
find_vma() | find_vma()
struct vm_area_struct *vma = NULL; |
vma = mm->mmap_cache; |
if (!(vma && vma->vm_end > addr |
&& vma->vm_start <= addr)) { |
| mm->mmap_cache = vma;
return vma; |
^^ compiler may optimize this |
local variable out and re-read |
mm->mmap_cache |
This issue can be reproduced with gcc-4.8.0-1 on s390x by running
mallocstress testcase from LTP, which triggers:
kernel BUG at mm/rmap.c:1088!
Call Trace:
([<000003d100c57000>] 0x3d100c57000)
[<000000000023a1c0>] do_wp_page+0x2fc/0xa88
[<000000000023baae>] handle_pte_fault+0x41a/0xac8
[<000000000023d832>] handle_mm_fault+0x17a/0x268
[<000000000060507a>] do_protection_exception+0x1e2/0x394
[<0000000000603a04>] pgm_check_handler+0x138/0x13c
[<000003fffcf1f07a>] 0x3fffcf1f07a
Last Breaking-Event-Address:
[<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
Thanks to Jakub Jelinek for his insight on gcc and helping to
track this down.
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After boot phase, 'n' always exist.
So add 'likely' macro for helping compiler.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
There are cases where userland wants to tweak the priority and
affinity of writeback flushers. Expose bdi_wq to userland by setting
WQ_SYSFS. It appears under /sys/bus/workqueue/devices/writeback/ and
allows adjusting maximum concurrency level, cpumask and nice level.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Writeback implements its own worker pool - each bdi can be associated
with a worker thread which is created and destroyed dynamically. The
worker thread for the default bdi is always present and serves as the
"forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when
using unbound workqueue instead is much simpler and more efficient.
This patch replaces custom worker pool implementation in writeback
with an unbound workqueue.
The conversion isn't too complicated but the followings are worth
mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed.
delayed_work ->dwork is added instead. Explicit timer handling is
no longer necessary. Everything works by either queueing / modding
/ flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
bdi_writeback->dwork. On each execution, it processes
bdi->work_list and reschedules itself if there are more things to
do.
The function also handles low-mem condition, which used to be
handled by the forker thread. If the function is running off a
rescuer thread, it only writes out limited number of pages so that
the rescuer can serve other bdis too. This preserves the flusher
creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
bdi_writeback_workfn() about on-going bdi unregistration so that it
always drains work_list even if it's running off the rescuer. Note
that the original code was broken in this regard. Under memory
pressure, a bdi could finish unregistration with non-empty
work_list.
* The default bdi is no longer special. It now is treated the same as
any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are
removed - writeback_nothread, writeback_wake_thread,
writeback_wake_forker_thread, writeback_thread_start,
writeback_thread_stop.
Everything, including devices coming and going away and rescuer
operation under simulated memory pressure, seems to work fine in my
test setup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
There's no user left. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Fengguang Wu <fengguang.wu@intel.com>
This symbol will be used in the Hyper-V balloon driver to support 2M
allocations.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 1869305009 ("mm: introduce VM_POPULATE flag to
better deal with racy userspace programs").
VM_POPULATE only has any effect when userspace plays racy games with
vmas by trying to unmap and remap memory regions that mmap or mlock are
operating on.
Also, the only effect of VM_POPULATE when userspace plays such games is
that it avoids populating new memory regions that get remapped into the
address range that was being operated on by the original mmap or mlock
calls.
Let's remove VM_POPULATE as there isn't any strong argument to mandate a
new vm_flag.
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
More prep work for immutable bvecs:
A few places in the code were either open coding or using the wrong
version - fix.
After we introduce the bvec iter, it'll no longer be possible to modify
the biovec through bio_for_each_segment_all() - it doesn't increment a
pointer to the current bvec, you pass in a struct bio_vec (not a
pointer) which is updated with what the current biovec would be (taking
into account bi_bvec_done and bi_size).
So because of that it's more worthwhile to be consistent about
bio_for_each_segment()/bio_for_each_segment_all() usage.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Alexander Viro <viro@zeniv.linux.org.uk>
__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx. Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.
For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
A bunch of what __blk_queue_bounce() was doing was problematic for the
immutable bvec work; this cleans that up and the code is quite a bit
smaller, too.
The __bio_for_each_segment() in copy_to_high_bio_irq() was changed
because that one's looping over the original bio, not the bounce bio -
a later patch renames __bio_for_each_segment() ->
bio_for_each_segment_all(), and documents that
bio_for_each_segment_all() is only for code that owns the bio.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.
Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
zone->wait_table may be allocated from bootmem, it can not be freed.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb_total_pages is used for overcommit calculations but the current
implementation considers only the default hugetlb page size (which is
either the first defined hugepage size or the one specified by
default_hugepagesz kernel boot parameter).
If the system is configured for more than one hugepage size, which is
possible since commit a137e1cc6d ("hugetlbfs: per mount huge page
sizes") then the overcommit estimation done by __vm_enough_memory()
(resp. shown by meminfo_proc_show) is not precise - there is an
impression of more available/allowed memory. This can lead to an
unexpected ENOMEM/EFAULT resp. SIGSEGV when memory is accounted.
Testcase:
boot: hugepagesz=1G hugepages=1
the default overcommit ratio is 50
before patch:
egrep 'CommitLimit' /proc/meminfo
CommitLimit: 55434168 kB
after patch:
egrep 'CommitLimit' /proc/meminfo
CommitLimit: 54909880 kB
[akpm@linux-foundation.org: coding-style tweak]
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [3.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_flags introduced in 6d7825b10d ("mm/fremap.c: fix oops on error
path") is supposed to avoid a compiler warning about unitialized
vm_flags without changing the generated code.
However I am concerned that this is going to be very brittle, and fail
with some compiler versions. The failure could be either of:
- compiler could actually load vma->vm_flags before checking for the
!vma condition, thus reintroducing the oops
- compiler could optimize out the !vma check, since the pointer just got
dereferenced shortly before (so the compiler knows it can't be NULL!)
I propose reversing this part of the change and initializing vm_flags to 0
just to avoid the bogus uninitialized use warning.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Tommi Rantala <tt.rantala@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If find_vma() fails, sys_remap_file_pages() will dereference `vma', which
contains NULL. Fix it by checking the pointer.
(We could alternatively check for err==0, but this seems more direct)
(The vm_flags change is to squish a bogus used-uninitialised warning
without adding extra code).
Reported-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove_memory() calls walk_memory_range() with [start_pfn, end_pfn), where
end_pfn is exclusive in this range. Therefore, end_pfn needs to be set to
the next page of the end address.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 887cbce0ad ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS")
I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where
needed. I am not sure what I was thinking. Instead, just directly
select VIRT_TO_BUS where it is needed.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
compat_process_vm_rw() shows that the compatibility code requires an
explicit "access_ok()" check before calling
compat_rw_copy_check_uvector(). The same difference seems to appear when
we compare fs/read_write.c:do_readv_writev() to
fs/compat.c:compat_do_readv_writev().
This subtle difference between the compat and non-compat requirements
should probably be debated, as it seems to be error-prone. In fact,
there are two others sites that use this function in the Linux kernel,
and they both seem to get it wrong:
Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
also ends up calling compat_rw_copy_check_uvector() through
aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
be missing. Same situation for
security/keys/compat.c:compat_keyctl_instantiate_key_iov().
I propose that we add the access_ok() check directly into
compat_rw_copy_check_uvector(), so callers don't have to worry about it,
and it therefore makes the compat call code similar to its non-compat
counterpart. Place the access_ok() check in the same location where
copy_from_user() can trigger a -EFAULT error in the non-compat code, so
the ABI behaviors are alike on both compat and non-compat.
While we are here, fix compat_do_readv_writev() so it checks for
compat_rw_copy_check_uvector() negative return values.
And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
handling.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A bunch of RCU callbacks want to be able to do vfree() and end up with
rather kludgy schemes. Just let vfree() do the right thing - put the
victim on llist and schedule actual __vunmap() via schedule_work(), so
that it runs from non-interrupt context.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A CONFIG_DISCONTIGMEM=y m68k config gave
mm/ksm.c: In function `get_kpfn_nid':
mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'
linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
(see arch/mips/include/asm/mmzone.h for example).
Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
and m68k got away without it so far, so fix the build in mm/ksm.c.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
n->end is accessed in sp_insert(). Thus it should be update
before calling sp_insert(). This mistake may make kernel panic.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Removed the following sparse warnings:
* mm/hugetlb.c:1764:6: warning: symbol
'hugetlb_unregister_node' was not declared.
Should it be static?
* mm/hugetlb.c:1808:6: warning: symbol
'hugetlb_register_node' was not declared.
Should it be static?
Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The extern in sys_sparc_64.c was a rudiment of time when do_mremap()
used to exist in MMU case (it doesn't anymore). As for !MMU one,
nothing uses it outside of mm/nommu.c...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE<n>,
killing the boilerplate crap around them.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull more VFS bits from Al Viro:
"Unfortunately, it looks like xattr series will have to wait until the
next cycle ;-/
This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
more file_inode() work"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
constify path_get/path_put and fs_struct.c stuff
fix nommu breakage in shmem.c
cache the value of file_inode() in struct file
9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
9p: make sure ->lookup() adds fid to the right dentry
9p: untangle ->lookup() a bit
9p: double iput() in ->lookup() if d_materialise_unique() fails
9p: v9fs_fid_add() can't fail now
v9fs: get rid of v9fs_dentry
9p: turn fid->dlist into hlist
9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
more file_inode() open-coded instances
selinux: opened file can't have NULL or negative ->f_path.dentry
(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4a ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f ("page_alloc: bootmem limit with movablecore_map")
42f47e27e7 ("page_alloc: make movablemem_map have higher priority")
6981ec3114 ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04 ("page_alloc: add movable_memmap kernel parameter")
4d59a75125 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
After we create a boot cache, we may allocate from it until it is bootstraped.
This will move the page from the partial list to the cpu slab list. If this
happens, the loop:
list_for_each_entry(p, &n->partial, lru)
that we use to scan for all partial pages will yield nothing, and the pages
will keep pointing to the boot cpu cache, which is of course, invalid. To do
that, we should flush the cache to make sure that the cpu slab is back to the
partial list.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Steffen Michalke <StMichalke@web.de>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
that already provide virt_to_bus().
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: H Hartley Sweeten <hartleys@visionengravers.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this
resulted in taking the mm->page_table_lock 512 times in a row.
We can do better by making use of the page_mask returned by
follow_page_mask (for the huge zero page case), or the size of the page
munlock_vma_page() operated on (for the true THP page case).
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stack vma is designed to grow automatically (marked with VM_GROWSUP
or VM_GROWSDOWN depending on architecture) when an access is made beyond
the existing boundary. However, particularly if you have not limited
your stack at all ("ulimit -s unlimited"), this can cause the stack to
grow even if the access was really just one past *another* segment.
And that's wrong, especially since we first grow the segment, but then
immediately later enforce the stack guard page on the last page of the
segment. So _despite_ first growing the stack segment as a result of
the access, the kernel will then make the access cause a SIGSEGV anyway!
So do the same logic as the guard page check does, and consider an
access to within one page of the next segment to be a bad access, rather
than growing the stack to abut the next segment.
Reported-and-tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that provided ->d_dname() reproduces what we used to get for
those guys in e.g. /proc/self/maps; it might be a good idea to change
that to something less ugly, but for now let's keep the existing
user-visible behaviour
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
lockdep, but it's a mechanical change.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf
4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf
3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6
eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee
ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn
4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC
Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1
AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO
32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv
lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO
rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK
+GszxsFkCjlW0mK0egTb
=tiSY
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module update from Rusty Russell:
"The sweeping change is to make add_taint() explicitly indicate whether
to disable lockdep, but it's a mechanical change."
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
MODSIGN: Add option to not sign modules during modules_install
MODSIGN: Add -s <signature> option to sign-file
MODSIGN: Specify the hash algorithm on sign-file command line
MODSIGN: Simplify Makefile with a Kconfig helper
module: clean up load_module a little more.
modpost: Ignore ARC specific non-alloc sections
module: constify within_module_*
taint: add explicit flag to show whether lock dep is still OK.
module: printk message when module signature fail taints kernel.
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
allocated, particularly when very few users will ever actually tune
merge_across_nodes 0 to use more than 1+1 of those trees. Not a big
deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
Start off with 1+1 statically allocated, then if merge_across_nodes is
ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to
free up the extra if it's tuned back, that would be a waste of effort.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I dislike the way in which "swapcache" gets used in do_swap_page():
there is always a page from swapcache there (even if maybe uncached by
the time we lock it), but tests are made according to "swapcache".
Rework that with "page != swapcache", as has been done in unuse_pte().
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before establishing that KSM page migration was the cause of my
WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
many respects is equivalent to faulting in a page.
In fact I've never caught that as the cause: but in theory it does at
least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
avoid bringing a KSM page back in when it's not supposed to be.
I intended to copy how it's done in do_swap_page(), but have a strong
aversion to how "swapcache" ends up being used there: rework it with
"page != swapcache".
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.
It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte. follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.
Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow(). I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Think of struct rmap_item as an extension of struct page (restricted to
MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
small, especially on 32-bit architectures of limited lowmem.
Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
making no change to its 64-byte struct rmap_item; but bloats the 32-bit
struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
rounds up to 40 bytes once allocated from slab. We'd better avoid that.
Hey, I only just remembered that the anon_vma pointer in struct
rmap_item has no purpose until the rmap_item is hung from a stable tree
node (which has its own nid field); and rmap_item's nid field no purpose
than to say which tree root to tell rb_erase() when unlinking from an
unstable tree.
Double them up in a union. There's just one place where we set anon_vma
early (when we already hold mmap_sem): now we must remove tree_rmap_item
from its unstable tree there, before overwriting nid. No need to
spatter BUG()s around: we'd be seeing oopses if this were wrong.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An inconsistency emerged in reviewing the NUMA node changes to KSM: when
meeting a page from the wrong NUMA node in a stable tree, we say that
it's okay for comparisons, but not as a leaf for merging; whereas when
meeting a page from the wrong NUMA node in an unstable tree, we bail out
immediately.
Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.
Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added slightly more detail to the Documentation of merge_across_nodes, a
few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it". No functional change.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix several mempolicy leaks in the tmpfs mount logic. These leaks are
slow - on the order of one object leaked per mount attempt.
Leak 1 (umount doesn't free mpol allocated in mount):
while true; do
mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
umount /mnt
done
Leak 2 (errors parsing remount options will leak mpol):
mount -t tmpfs -o size=100M nodev /mnt
while true; do
mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
done
umount /mnt
Leak 3 (multiple mpol per mount leak mpol):
while true; do
mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
umount /mnt
done
This patch fixes all of the above. I could have broken the patch into
three pieces but is seemed easier to review as one.
[akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
option is not specified in the remount request. A new policy can be
specified if mpol=M is given.
Before this patch remounting an mpol bound tmpfs without specifying
mpol= mount option in the remount request would set the filesystem's
mempolicy object to a freed mempolicy object.
To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
# mkdir /tmp/x
# mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
# grep /tmp/x /proc/mounts
nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
# mount -o remount,size=200M nodev /tmp/x
# grep /tmp/x /proc/mounts
nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
# note ? garbage in mpol=... output above
# dd if=/dev/zero of=/tmp/x/f count=1
# panic here
Panic:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
[...]
Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
Call Trace:
mpol_shared_policy_init+0xa5/0x160
shmem_get_inode+0x209/0x270
shmem_mknod+0x3e/0xf0
shmem_create+0x18/0x20
vfs_create+0xb5/0x130
do_last+0x9a1/0xea0
path_openat+0xb3/0x4d0
do_filp_open+0x42/0xa0
do_sys_open+0xfe/0x1e0
compat_sys_open+0x1b/0x20
cstar_dispatch+0x7/0x1f
Non-debug kernels will not crash immediately because referencing the
dangling mpol will not cause a fault. Instead the filesystem will
reference a freed mempolicy object, which will cause unpredictable
behavior.
The problem boils down to a dropped mpol reference below if
shmem_parse_options() does not allocate a new mpol:
config = *sbinfo
shmem_parse_options(data, &config, true)
mpol_put(sbinfo->mpol)
sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */
This patch avoids the crash by not releasing the mempolicy if
shmem_parse_options() doesn't create a new mpol.
How far back does this issue go? I see it in both 2.6.36 and 3.3. I did
not look back further.
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rob van der Heij reported the following (paraphrased) on private mail.
The scenario is that I want to avoid backups to fill up the page
cache and purge stuff that is more likely to be used again (this is
with s390x Linux on z/VM, so I don't give it as much memory that
we don't care anymore). So I have something with LD_PRELOAD that
intercepts the close() call (from tar, in this case) and issues
a posix_fadvise() just before closing the file.
This mostly works, except for small files (less than 14 pages)
that remains in page cache after the face.
Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.
The issue is the per-cpu pagevecs for LRU additions. If the pages are
added by one CPU but fadvise() is called on another then the pages
remain resident as the invalidate_mapping_pages() only drains the local
pagevecs via its call to pagevec_release(). The user-visible effect is
that a program that uses fadvise() properly is not obeyed.
A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a
pagevec page could not be discarded. The downside with this is that an
inode cache shrink would send a global IPI and memory pressure
potentially causing global IPI storms is very undesirable.
Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages.
If a subset of pages are discarded it drains the LRU pagevecs and tries
again. If the second attempt fails, it assumes it is due to the pages
being mapped, locked or dirty and does not care. With this patch, an
application using fadvise() correctly will be obeyed but there is a
downside that a malicious application can force the kernel to send
global IPIs and increase overhead.
If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.
The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.
int main() {
int fd;
int pagesize = getpagesize();
ssize_t written = 0, expected;
char *buf;
unsigned char *vec;
int resident, i;
cpu_set_t set;
/* Prepare a buffer for writing */
expected = FILESIZE_PAGES * pagesize;
buf = malloc(expected + 1);
if (buf == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
buf[expected] = 0;
memset(buf, 'a', expected);
/* Prepare the mincore vec */
vec = malloc(FILESIZE_PAGES);
if (vec == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
/* Bind ourselves to CPU 0 */
CPU_ZERO(&set);
CPU_SET(0, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* open file, unlink and write buffer */
fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
unlink("fadvise-test-file");
while (written < expected) {
ssize_t this_write;
this_write = write(fd, buf + written, expected - written);
if (this_write == -1) {
perror("write");
exit(EXIT_FAILURE);
}
written += this_write;
}
free(buf);
/*
* Force ourselves to another CPU. If fadvise only flushes the local
* CPUs pagevecs then the fadvise will fail to discard all file pages
*/
CPU_ZERO(&set);
CPU_SET(1, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* sync and fadvise to discard the page cache */
fsync(fd);
if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
perror("posix_fadvise");
exit(EXIT_FAILURE);
}
/* map the file and use mincore to see which parts of it are resident */
buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
if (buf == NULL) {
perror("mmap");
exit(EXIT_FAILURE);
}
if (mincore(buf, expected, vec) == -1) {
perror("mincore");
exit(EXIT_FAILURE);
}
/* Check residency */
for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
if (vec[i])
resident++;
}
if (resident != 0) {
printf("Nr unexpected pages resident: %d\n", resident);
exit(EXIT_FAILURE);
}
munmap(buf, expected);
close(fd);
free(vec);
exit(EXIT_SUCCESS);
}
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Tested-by: Rob van der Heij <rvdheij@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We at SGI have a need to address some very high physical address ranges
with our GRU (global reference unit), sometimes across partitioned
machine boundaries and sometimes with larger addresses than the cpu
supports. We do this with the aid of our own 'extended vma' module
which mimics the vma. When something (either unmap or exit) frees an
'extended vma' we use the mmu notifiers to clean them up.
We had been able to mimic the functions
__mmu_notifier_invalidate_range_start() and
__mmu_notifier_invalidate_range_end() by locking the per-mm lock and
walking the per-mm notifier list. But with the change to a global srcu
lock (static in mmu_notifier.c) we can no longer do that. Our module has
no access to that lock.
So we request that these two functions be exported.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Acked-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.
follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
a THP page, and to 0 in other cases.
__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL, we
don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
we also avoid taking mm->page_table_lock that many times).
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use long type for page counts in mm_populate() so as to avoid integer
overflow when running the following test code:
int main(void) {
void *p = mmap(NULL, 0x100000000000, PROT_READ,
MAP_PRIVATE | MAP_ANON, -1, 0);
printf("p: %p\n", p);
mlockall(MCL_CURRENT);
printf("done\n");
return 0;
}
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.
[akpm@linux-foundation.org: tweak comments]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
error_states[] has two separate states "unevictable LRU page" and
"mlocked LRU page", and the former one has the higher priority now. But
because of that the latter one is rarely chosen because pages with
PageMlocked highly likely have PG_unevictable set. On the other hand,
PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
shared memory, so reversing the priority of these two states helps us
clearly distinguish them.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory_failure() can't handle memory errors on mlocked pages correctly,
because page_action() judges such errors as ones on "unknown pages"
instead of ones on "unevictable LRU page" or "mlocked LRU page". In
order to determine page_state page_action() checks page flags at the
timing of the judgement, but such page flags are not the same with those
just after memory_failure() is called, because memory_failure() does
unmapping of the error pages before doing page_action(). This unmapping
changes the page state, especially page_remove_rmap() (called from
try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
mlocked pages after that.
With this patch, we store the page flag of the error page before doing
unmap, and (only) if the first check with page flags at the time decided
the error page is unknown, we do the second check with the stored page
flag. This implementation doesn't change error handling for the page
types for which the first check can determine the page state correctly.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This variable is calculated from nr_free_pagecache_pages so
change its type to unsigned long.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the amount of RAM that functions nr_free_*_pages return is
held in unsigned int. But in machines with big memory (exceeding 16TB),
the amount may be incorrect because of overflow, so fix it.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc
callback and assume that nothing happens before root cgroup is created.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.
This patch wraps the current memcg_stock initialization code into a
helper calls it from the controller subsystem initialization code.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg
allocation code with something that can be called when the memcg
subsystem is initialized by mem_cgroup_init along with other controller
specific parts.
While we are at it let's make mem_cgroup_soft_limit_tree_init void
because it doesn't make much sense to report memory failure because if
we fail to allocate memory that early during the boot then we are
screwed anyway (this saves some code).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently, Luigi reported there are lots of free swap space when OOM
happens. It's easily reproduced on zram-over-swap, where many instance
of memory hogs are running and laptop_mode is enabled. He said there
was no problem when he disabled laptop_mode. The problem when I
investigate problem is following as.
Assumption for easy explanation: There are no page cache page in system
because they all are already reclaimed.
1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
2. shrink_inactive_list isolates victim pages from inactive anon lru list.
3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
pageout because sc->may_writepage is 0 so the page is rotated back into
inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
retry reclaim with higher priority.
5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
inactive anon lru list is full of dirty pages by 3 so it just returns
without any reclaim progress.
6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
Because sc->nr_scanned is increased by shrink_page_list but we don't call
shrink_page_list in 5 due to short of isolated pages.
Above loop is continued until OOM happens.
The problem didn't happen before [1] was merged because old logic's
isolatation in shrink_inactive_list was successful and tried to call
shrink_page_list to pageout them but it still ends up failed to page out
by may_writepage. But important point is that sc->nr_scanned was
increased although we couldn't swap out them so do_try_to_free_pages
could set may_writepages.
Since commit f80c067361 ("mm: zone_reclaim: make isolate_lru_page()
filter-aware") was introduced, it's not a good idea any more to depends
on only the number of scanned pages for setting may_writepage. So this
patch adds new trigger point of setting may_writepage as below
DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
in VM so it's good fit for our purpose which would be better to lose
power saving or clickety rather than OOM killing.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Luigi Semenzato <semenzato@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make a sweep through mm/ and convert code that uses -1 directly to using
the more appropriate NUMA_NO_NODE.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().
Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).
A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() <--- NULL pointer deref.
Additionally, the list traversal in __mmu_notifier_release() is not
protected by the by the mmu_notifier_mm->hlist_lock which can result in
callouts to the ->release() notifier from both mmu_notifier_unregister()
and __mmu_notifier_release().
-stable suggestions:
The stable trees prior to 3.7.y need commits 21a92735f6 and
70400303ce cherry-picked in that order prior to cherry-picking this
commit. The 3.7.y tree already has those two commits.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Sagi Grimberg <sagig@mellanox.co.il>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ensure_zone_is_initialized() checks if a zone is in a empty & not
initialized state (typically occuring after it is created in memory
hotplugging), and, if so, calls init_currently_empty_zone() to
initialize the zone.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a debug message which prints when a page is found outside of the
boundaries of the zone it should belong to. Format is:
"page $pfn outside zone [ $start_pfn - $end_pfn ]"
[akpm@linux-foundation.org: s/pr_debug/pr_err/]
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Freeing pages to uninitialized zones is not handled by
__free_one_page(), and should never happen when the code is correct.
Ran into this while writing some code that dynamically onlines extra
zones.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.
This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.
Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fact that mlock calls get_user_pages, and get_user_pages might call
mlock when expanding a stack looks like a potential recursion.
However, mlock makes sure the requested range is already contained
within a vma, so no stack expansion will actually happen from mlock.
Should this ever change: the stack expansion mlocks only the newly
expanded range and so will not result in recursive expansion.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg
zone LRU list. The only difference is in how the LRU size is assessed.
get_lru_size() does the right thing for both global and memcg reclaim
situations.
Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
the numbers in common code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In shmem_find_get_pages_and_swap(), use the faster radix tree iterator
construct from commit 78c1d78488 ("radix-tree: introduce bit-optimized
iterator").
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
problem because notifier callbacks are made under down_read of
blocking_notifier_head->rwsem (so first the mutex is taken while holding
the rwsem, then later the rwsem is taken while still holding the mutex);
but is not in fact a problem because mem_hotplug_mutex is held
throughout the dance.
There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.
I had hoped to eradicate this issue in extending KSM page migration not
to need the ksm_thread_mutex. But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
vanish, and get_ksm_page()'s accesses to them become a violation.
So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep.
This is less elegant for KSM, but it's more important to keep lockdep
useful to other users - and I apologize for how long it took to fix.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migration of KSM pages is now safe: remove the PageKsm restrictions from
mempolicy.c and migrate.c.
But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
are irrelevant to KSM: it looks as if that code was preventing hotremove
migration of KSM pages, unless they happened to be in swapcache.
There is some question as to whether enforcing a NUMA mempolicy migration
ought to migrate KSM pages, mapped into entirely unrelated processes; but
moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
any area where this is a worry.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree? And what if that
collides with an existing stable node?
ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex. So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until the
next pass of ksmd, just be careful not to merge other pages on to a
misplaced page. Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.
A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at a
matching node for another page, and this node page is found misplaced.
In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree. If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(), in
order to handle these insertions of migrated nodes.
remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself. Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.
But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree? And what if
that collides with an existing stable node?
So far there's no provision for that, and this patch does not attempt to
deal with it either. But how will I test a solution, when I don't know
how to hotremove memory? The best answer is to enable KSM page migration
in all cases now, and test more common cases. With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.
Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).
It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.
Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.
The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here. That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.
(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)
You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree. It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.
How can this happen? We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.
Three causes:
1. The old stable tree (built according to the inverse
merge_across_nodes) has not been fully torn down. A stable node
lingers until get_ksm_page() notices that the page it references no
longer references it: but the page is not necessarily freed as soon as
expected, particularly when swapcache.
Fix this with a pass through the old stable tree, applying
get_ksm_page() to each of the remaining nodes (most found stale and
removed immediately), with forced removal of any left over. Unless the
page is still mapped: I've not seen that case, it shouldn't occur, but
better to WARN_ON_ONCE and EBUSY than BUG.
2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it. Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor. Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.
3. It is possible for a KSM page to be faulted back from swapcache
into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
private to ksm.c, so dissolve the distinction between
ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
the one call into ksm.c.
A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
error when read in from swap) to a page which it then marks Uptodate. Fix
this case by not copying, letting do_swap_page() discover the error.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some places where get_ksm_page() is used, we need the page to be locked.
When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us (raised page count is
only effective when there is serialization to make sure migration
notices). Whereas when navigating through the stable tree, we certainly
do not want to lock each node (raised page count is enough to guarantee
the memcmps, even if page is migrated to another node).
Since we're about to add another use case, add the locked argument to
get_ksm_page() now.
Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I
really got the wrong end of the stick on that! There's a configuration in
which page_cache_get_speculative() can do something cheaper than
get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
disabled preemption for it. There's no need for rcu_read_lock() around
get_page_unless_zero() (and mapping checks) here. Cut out that silliness
before making this any harder to understand.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
(restarting whenever it finds a stale node to remove), but rearrange so
that at least it does not needlessly restart from nid 0 each time. And
add a couple of comments: here is why we keep pfn instead of page.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef
CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid
when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as
"int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA
to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few
comments noticed in ongoing work.
Leave stable_tree_insert()'s rb_linkage until after the node has been
set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page
lock make either way safe, but we're going to copy and I prefer this
precedent.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.
(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense in
preferring to allocate the new pages local to the caller's node.)
This patch:
Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes. When it is set
to zero only pages from the same node are merged, otherwise pages from
all nodes can be merged together (default behavior).
Typical use-case could be a lot of KVM guests on NUMA machine and cpus
from more distant nodes would have significant increase of access
latency to the merged ksm page. Sysfs knob was choosen for higher
variability when some users still prefers higher amount of saved
physical memory regardless of access latency.
Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is
possible only when there are not any ksm shared pages in system.
I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:
http://pholasek.fedorapeople.org/alloc_pg.c
Population standard deviations of access times in percentage of average
were following:
merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%
merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%
RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225
Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as
it is used throughout the arch code a rename would likely cause more
conflicts than it is worth.
[akpm@linux-foundation.org: fix zcache]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When use_hierarchy is enabled, we acquire an extra reference count in
our parent during cgroup creation. We don't release it, though, if any
failure exist in the creation process.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were deferring the kmemcg static branch increment to a later time,
due to a nasty dependency between the cpu_hotplug lock, taken by the
jump label update, and the cgroup_lock.
Now we no longer take the cgroup lock, and we can save ourselves the
trouble.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the preparation work done in earlier patches, the cgroup_lock can
be trivially replaced with a memcg-specific lock. This is an automatic
translation at every site where the values involved were queried.
The sites where values are written, however, used to be naturally called
under cgroup_lock. This is the case for instance in the css_online
callback. For those, we now need to explicitly add the memcg lock.
With this, all the calls to cgroup_lock outside cgroup core are gone.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.
This is less than ideal, because it enforces a dependency over cgroup
core that we would be better off without. The solution proposed here is
to iterate over the child cgroups and if any is found that is already
online, we bounce and return: we don't really care how many children we
have, only if we have any.
This is also made to be hierarchy aware. IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.
The memory controller uses some tunables to adjust its operation. Those
tunables are inherited from parent to children upon children
intialization. For most of them, the value cannot be changed after the
parent has a new children.
cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are done.
But the new group is not yet visible anywhere, not even for cgroup core
code. It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists. Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup. We do this because we rely on
cgroup core to provide us with this information.
We need to guarantee that upon child creation, our tunables are
consistent. For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the
group that is hierarchy-related. For instance, the use_hierarchy flag
cannot be changed if the cgroup already have children.
Furthermore, those values are propagated from the parent to the child
when a new child is created. So if we don't lock like this, we can end
up with the following situation:
A B
memcg_css_alloc() mem_cgroup_hierarchy_write()
copy use hierarchy from parent change use hierarchy in parent
finish creation.
This is mainly because during create, we are still not fully connected
to the css tree. So all iterators and the such that we could use, will
fail to show that the group has children.
My observation is that all of creation can proceed in parallel with
those tasks, except value assignment. So what this patch series does is
to first move all value assignment that is dependent on parent values
from css_alloc to css_online, where the iterators all work, and then we
lock only the value assignment. This will guarantee that parent and
children always have consistent values. Together with an online test,
that can be derived from the observation that the refcount of an online
memcg can be made to be always positive, we should be able to
synchronize our side without the cgroup lock.
This patch:
Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration. However, this is only
needed because the current strategy keeps checking this value throughout
the whole process. Since all we need is serialization, one needs only
to guarantee that whatever decision we made in the beginning of a
specific migration is respected throughout the process.
We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.
Because we want to statically allocate those, this array ends up being
very big. Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.
However, we can do better in some cases if the firmware help us. This
is true for modern x86 machines; coincidentally one of the architectures
in which MAX_NUMNODES tends to be very big.
By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably. In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k. This is
particularly important because it means that we will no longer resort to
the vmalloc area for the struct memcg on defconfigs. We also have
enough room for an extra node and still be outside vmalloc.
One also has to keep in mind that with the industry's ability to fit
more processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.
[akpm@linux-foundation.org: add check for invalid nid, remove inline]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Answering the question "how much space remains in the page->flags" is
time-consuming. mminit_loglevel can help answer the question but it
does not take last_nid information into account. This patch corrects it
and while there it corrects the messages related to page flag usage,
pgshifts and node/zone id. When applied the relevant output looks
something like this but will depend on the kernel configuration.
mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton pointed out that page_xchg_last_nid() and
reset_page_last_nid() were "getting nuttily large" and asked that it be
investigated.
reset_page_last_nid() is on the page free path and it would be
unfortunate to make that path more expensive than it needs to be. Due
to the internal use of page_xchg_last_nid() it is already too expensive
but fortunately, it should also be impossible for the page->flags to be
updated in parallel when we call reset_page_last_nid(). Instead of
unlining the function, it uses a simplier implementation that assumes no
parallel updates and should now be sufficiently short for inlining.
page_xchg_last_nid() is called in paths that are already quite expensive
(splitting huge page, fault handling, migration) and it is reasonable to
uninline. There was not really a good place to place the function but
mm/mmzone.c was the closest fit IMO.
This patch saved 128 bytes of text in the vmlinux file for the kernel
configuration I used for testing automatic NUMA balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg swap accounting is currently enabled by enable_swap_cgroup when
the root cgroup is created. mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well. We already register memsw files from there so it makes a lot
of sense to merge those two into a single enable_swap_cgroup function.
This patch doesn't introduce any semantic changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior
has been introduced by commit af36f906c0 ("memcg: always create memsw
files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear
that memsw operations are not supported in the given configuration it is
fair to say that this behavior could be quite confusing.
Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
swapaccount=1 boot parameter). We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens
after boot command line is processed.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Zhouping Liu <zliu@redhat.com>
Tested-by: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.
Addresses http://bugs.debian.org/695182
[akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
[akpm@linux-foundation.org: fix min() warning]
Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in commit 4fc3f1d66b ("mm/rmap, migration: Make
rmap_walk_anon() and try_to_unmap_anon() more scalable") says:
| Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
| to make it clearer that it's an exclusive write-lock in
| that case - suggested by Rik van Riel.
But that commit renames only anon_vma_lock()
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD). The main contention comes
from swap_info_get(). This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock. In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages. But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it. read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list. To delete that code, we add a
new highest_priority_index. Whenever get_swap_page() is called, we
check it. If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock(). But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush). And
BTW, looks get_swap_page() doesn't really need the lock. We never free
swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s. This patch further improves the
speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.
[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.
In my test with 3 SSD, this increases the swapout throughput 20%.
[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When correcting commit 04fa5d6a65 ("mm: migrate: check page_count of
THP before migrating") Hugh Dickins noted that the control flow for
transhuge migration was difficult to follow. Unconditionally calling
put_page() in numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be. Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the
put_page() should never be the final put_page.
Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().
There is no functional change after this patch is applied but the code
is easier to follow and unlock_page() always happens before put_page().
[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
NUMA configuration with NUMA Balancing will still need an extra page
field. As Peter notes "Completely dropping 32bit support for
CONFIG_NUMA_BALANCING would simplify things, but it would also remove
the warning if we grow enough 64bit only page-flags to push the last-cpu
out."
[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking
THP.
The consequences are that a migration will be attempted when a target
node is nearly full and fail later. It's unlikely to be user-visible
but it should be fixed. While we are there, migrate_balanced_pgdat()
should treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)
- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From: Zlatko Calusic <zlatko.calusic@iskon.hr>
Commit 92df3a723f ("mm: vmscan: throttle reclaim if encountering too
many dirty pages under writeback") introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list().
What this means is that there's no more need for throttling and
additional heuristics in balance_pgdat(). So, let's remove it and tidy
up the code.
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from
get_any_page().
This function should only get page refcount as the name implies, but it
does some page isolating actions like SetPageHWPoison() and dequeuing
hugepage. This patch corrects it and introduces some internal
subroutines to make soft offlining code more readable and maintainable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.
[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc
instead of atomic_long_add.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing
$ echo paddr > /sys/devices/system/memory/soft_offline_page
to offline a *free* page, the value of mce_bad_pages will be added, and
the page is set HWPoison flag, but it is still managed by page buddy
alocator.
$ cat /proc/meminfo | grep HardwareCorrupted
shows the value.
If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.
soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"
This patch:
Move poisoned page check at the beginning of the function in order to
fix the error.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:
present_pages = spanned_pages - absent_pages;
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We now provide an option for users who don't want to specify physical
memory address in kernel commandline.
/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/
So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.
If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.
Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.
NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.
Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.
We now implement movablemem_map boot option like this:
/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/
[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure the bootmem will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablecore_map boot option.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
limit from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of
ZONE_MOVABLE for each node.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablemem_map.map array.
And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.
[akpm@linux-foundation.org: improve comment]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: remove unneeded parens]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the node is offlined, there is no memory/cpu on the node. If a
sleep task runs on a cpu of this node, it will be migrated to the cpu on
the other node. So we can clear cpu-to-node mapping.
[akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.
The node will be offlined when all memory/cpu on the node have been
hotremoved. So we need the function try_offline_node() in cpu-hotplug
path.
If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
the node has been onlined, and cpu's node is still needed
to migrate the sleep task on the cpu to the same node.
So we do nothing in try_offline_node() in this case.
[rientjes@google.com: export the function try_offline_node() fix]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_valid_nodemask() was introduced by commit 19770b3260 ("mm: filter
based on a nodemask as well as a gfp_mask"). but it does not match its
comments, because it does not check the zone which > policy_zone.
Also in commit b377fd3982 ("Apply memory policies to top two highest
zones when highest zone is ZONE_MOVABLE"), this commits told us, if
highest zone is ZONE_MOVABLE, we should also apply memory policies to
it. so ZONE_MOVABLE should be valid zone for policies.
is_valid_nodemask() need to be changed to match it.
Fix: check all zones, even its zoneid > policy_zone. Use
nodes_intersects() instead open code to check it.
Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since there is no way to guarentee the address of pgdat/zone is not on
stack of any kernel threads or used by other kernel objects without
reference counting or other symchronizing method, we cannot reset
node_data and free pgdat when offlining a node. Just reset pgdat to 0
and reuse the memory when the node is online again.
The problem is suggested by Kamezawa Hiroyuki. The idea is from Wen
Congyang.
NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
will be triggered.
[akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
[akpm@linux-foundation.org: fix the warning again again]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new function try_offline_node() to remove sysfs file of node
when all memory sections of this node are removed. If some memory
sections of this node are not removed, this function does nothing.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in __add_zone(). So we should revert them when the memory
is removed.
The patch adds a new function __remove_zone() to do this.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even
if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables. Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().
Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't
need to lock the whole function. If we do some work to free pagetables
in free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.
If we lock the whole sparse_remove_one_section(), then we come to this call trace:
------------[ cut here ]------------
WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
Hardware name: PRIMEQUEST 1800E
......
Call Trace:
smp_call_function_many+0xbd/0x260
smp_call_function+0x3b/0x50
on_each_cpu+0x3b/0xc0
flush_tlb_all+0x1c/0x20
remove_pagetable+0x14e/0x1d0
vmemmap_free+0x18/0x20
sparse_remove_one_section+0xf7/0x100
__remove_section+0xa2/0xb0
__remove_pages+0xa0/0xd0
arch_remove_memory+0x6b/0xc0
remove_memory+0xb8/0xf0
acpi_memory_device_remove+0x53/0x96
acpi_device_remove+0x90/0xb2
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_remove+0x32/0x6d
acpi_bus_trim+0x91/0x102
acpi_bus_hot_remove_device+0x88/0x16b
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
---[ end trace 25e85300f542aa01 ]---
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().
NOTE: register_page_bootmem_memmap() is not implemented for ia64,
ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
and revert register_page_bootmem_info_node() when platform doesn't
support it.
It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
by memory-hotplug feature fully supported archs(currently only on
x86_64).
Since we have 2 config options called MEMORY_HOTPLUG and
MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
and codes in function register_page_bootmem_info_node() are only
used for collecting infomation for hot-remove, so reside it under
MEMORY_HOTREMOVE.
Besides page_isolation.c selected by MEMORY_ISOLATION under
MEMORY_HOTPLUG is also such case, move it too.
[mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
[linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
[mhocko@suse.cz: remove the arch specific functions without any implementation]
[linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
[rientjes@google.com: fix defined but not used warning]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wu Jianguo <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().
Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
type} sysfs files are created. But there is no code to remove these
files. This patch implements the function to remove them.
We cannot free firmware_map_entry which is allocated by bootmem because
there is no way to do so when the system is up. But we can at least
remember the address of that memory and reuse the storage when the
memory is added next time.
This patch also introduces a new list map_entries_bootmem to link the
map entries allocated by bootmem when they are removed, and a lock to
protect it. And these entries will be reused when the memory is
hot-added again.
The idea is suggestted by Andrew Morton.
NOTE: It is unsafe to return an entry pointer and release the
map_entries_lock. So we should not hold the map_entries_lock
separately in firmware_map_find_entry() and
firmware_map_remove_entry(). Hold the map_entries_lock across find
and remove /sys/firmware/memmap/X operation.
And also, users of these two functions need to be careful to
hold the lock when using these two functions.
[tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation]
[tangchen@cn.fujitsu.com: fix the wrong comments of map_entries]
[tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
[tangchen@cn.fujitsu.com: fix section mismatch problem]
[tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug
All memory blocks must be offlined before removing memory. But we don't
hold the lock in the whole operation. So we should check whether all
memory blocks are offlined before step6. Otherwise, kernel maybe
panicked.
Offlining a memory block and removing a memory device can be two
different operations. Users can just offline some memory blocks without
removing the memory device. For this purpose, the kernel has held
lock_memory_hotplug() in __offline_pages(). To reuse the code for
memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
repeatedly lock and unlock memory hotplug, but not hold the memory
hotplug lock in the whole operation.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.
When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.
This idea is suggested by KOSAKI Motohiro.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.
This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm_populate() code populates user mappings without constantly
holding the mmap_sem. This makes it susceptible to racy userspace
programs: the user mappings may change while vm_populate() is running,
and in this case vm_populate() may end up populating the new mapping
instead of the old one.
In order to reduce the possibility of userspace getting surprised by
this behavior, this change introduces the VM_POPULATE vma flag which
gets set on vmas we want vm_populate() to work on. This way
vm_populate() may still end up populating the new mapping after such a
race, but only if the new mapping is also one that the user has
requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
the vma type - we know we're working with a stack. So, we can call
directly into __mlock_vma_pages_range(), and remove the last
make_pages_present() call site.
Note that we don't use mm_populate() here, so we can't release the
mmap_sem while allocating new stack pages. This is deemed acceptable,
because the stack vmas grow by a bounded number of pages at a time, and
these are anon pages so we don't have to read from disk to populate
them.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the MAP_POPULATE handling has been moved to mmap_region() call
sites, the only remaining use of the flags argument is to pass the
MAP_NORESERVE flag. This can be just as easily handled by
do_mmap_pgoff(), so do that and remove the mmap_region() flags
parameter.
[akpm@linux-foundation.org: remove double parens]
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.
This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().
Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have many vma manipulation functions that are fast in the typical
case, but can optionally be instructed to populate an unbounded number
of ptes within the region they work on:
- mmap with MAP_POPULATE or MAP_LOCKED flags;
- remap_file_pages() with MAP_NONBLOCK not set or when working on a
VM_LOCKED vma;
- mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
effect;
- brk() when mlock(MCL_FUTURE) is in effect.
Current code handles these pte operations locally, while the
sourrounding code has to hold the mmap_sem write side since it's
manipulating vmas. This means we're doing an unbounded amount of pte
population work with mmap_sem held, and this causes problems as Andy
Lutomirski reported (we've hit this at Google as well, though it's not
entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
first place).
I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released. mm_populate()
does need to acquire the mmap_sem read side, but critically, it doesn't
need to hold it continuously for the entire duration of the operation -
it can drop it whenever things take too long (such as when hitting disk
for a file read) and re-acquire it later on.
The following patches are included
- Patches 1 fixes some issues I noticed while working on the existing code.
If needed, they could potentially go in before the rest of the patches.
- Patch 2 introduces the new mm_populate() function and changes
mmap_region() call sites to use it after they drop mmap_sem. This is
inspired from Andy Lutomirski's proposal and is built as an extension
of the work I had previously done for mlock() and mlockall() around
v2.6.38-rc1. I had tried doing something similar at the time but had
given up as there were so many do_mmap() call sites; the recent cleanups
by Linus and Viro are a tremendous help here.
- Patches 3-5 convert some of the less-obvious places doing unbounded
pte populates to the new mm_populate() mechanism.
- Patches 6-7 are code cleanups that are made possible by the
mm_populate() work. In particular, they remove more code than the
entire patch series added, which should be a good thing :)
- Patch 8 is optional to this entire series. It only helps to deal more
nicely with racy userspace programs that might modify their mappings
while we're trying to populate them. It adds a new VM_POPULATE flag
on the mappings we do want to populate, so that if userspace replaces
them with mappings it doesn't want populated, mm_populate() won't
populate those replacement mappings.
This patch:
Assorted small fixes. The first two are quite small:
- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
Purely cosmetic.
- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
range, make sure we own the mmap_sem write lock around the
munlock_vma_pages_range call as this manipulates the vma's vm_flags.
Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:
- In the VM_NONLINEAR case, new file pages would be populated if
the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
new file pages wouldn't be populated even if the vma is already
marked as VM_LOCKED.
- In the linear (emulated) case, the work is passed to the mmap_region()
function which would populate the pages if the vma is marked as
VM_LOCKED, and would not otherwise - regardless of the value of the
MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
mmap_region().
The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
check the status, then break the loop if pgdat is balanced, just to be
immediately called again. The second call is completely unnecessary, of
course.
The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is
swapout, this syscall can do swapin prefetch. It has no impact if the
memory isn't swapout.
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
[sasha.levin@oracle.com: fix BUG on madvise early failure]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Targeted (hard resp soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted. The reclaim is
then retried until the limit is met.
This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they
have only re-parented pages after some of their children is removed).
Those groups are reclaimed with decreasing priority pointlessly as there
is nothing to reclaim from them.
An easiest fix is to break out of the memcg iteration loop in
shrink_zone only if the whole hierarchy has been visited or sufficient
pages have been reclaimed. This is also more natural because the
reclaimer expects that the hierarchy under the given root is reclaimed.
As a result we can simplify the soft limit reclaim which does its own
iteration.
[yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
[akpm@linux-foundation.org: use conventional comparison order]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Ying Han <yinghan@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch ksm to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the ksm module.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch hugemem to use the new hashtable implementation. This reduces
the amount of generic unrelated code in the hugemem.
This also removes the dymanic allocation of the hash table. The upside
is that we save a pointer dereference when accessing the hashtable, but
we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
doesn't support hugepages.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN. It happened to work when initially
implemented as the starting PFN was also aligned but with caching
restarts and isolating in smaller chunks this is no longer always true.
The impact is that the migrate scanner scans outside its current
pageblock. As pfn_valid() is still checked properly it does not cause
any failure and the impact of the bug is that in some cases it will scan
more than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX. It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`int' is an inappropriate type for a number-of-pages counter.
While we're there, use the clamp() macro.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.
These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.
Use page_add_new_anon_rmap() for these copies. This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level. But this does not make sense,
because the container of interest for compaction is a zone as a whole,
not the zone pages that are part of a certain memory cgroup.
Negative impact is bounded. For one, the code checks that the lruvec
has enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled. And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be
amortized by the round robin manner in which reclaim goes through the
memory cgroups. Still, this can lead to unnecessary allocation
latencies when the code elects to restart on a hard to reclaim or small
group when there are other, more reclaimable groups in the zone.
Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.
[akpm@linux-foundation.org: no need for min_t]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.
Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which is
not necessarily the most obvious way:
fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;
Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.
[akpm@linux-foundation.org: avoid using unintialized_var()]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>