Since commit bda807d444 ("mm: migrate: support non-lru movable page
migration") isolate_migratepages_block) can isolate !PageLRU pages which
would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
heuristics based on those counters such as pgdat_reclaimable_pages resp.
too_many_isolated which would lead to unexpected stalls during the
direct reclaim without any good reason. Note that
__alloc_contig_migrate_range can isolate a lot of pages at once.
On mobile devices such as 512M ram android Phone, it may use a big zram
swap. In some cases zram(zsmalloc) uses too many non-lru but
migratedable pages, such as:
MemTotal: 468148 kB
Normal free:5620kB
Free swap:4736kB
Total swap:409596kB
ZRAM: 164616kB(zsmalloc non-lru pages)
active_anon:60700kB
inactive_anon:60744kB
active_file:34420kB
inactive_file:37532kB
Fix this by only accounting lru pages to NR_ISOLATED_* in
isolate_migratepages_block right after they were isolated and we still
know they were on LRU. Drop acct_isolated because it is called after
the fact and we've lost that information. Batching per-cpu counter
doesn't make much improvement anyway. Also make sure that we uncharge
only LRU pages when putting them back on the LRU in
putback_movable_pages resp. when unmap_and_move migrates the page.
[mhocko@suse.com: replace acct_isolated() with direct counting]
Fixes: bda807d444 ("mm: migrate: support non-lru movable page migration")
Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
Signed-off-by: Ming Ling <ming.ling@spreadtrum.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_THISNODE is documented to enforce the allocation to be satisified
from the requested node with no fallbacks or placement policy
enforcements. policy_zonelist seemingly breaks this semantic if the
current policy is MPOL_MBIND and instead of taking the node it will
fallback to the first node in the mask if the requested one is not in
the mask. This is confusing to say the least because it fact we
shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
with nodes outside of their mempolicy binding. And secondly
policy_zonelist is called only from 3 places:
- huge_zonelist - never should do __GFP_THISNODE when going this path
- alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
- alloc_pages_current - which uses default_policy id __GFP_THISNODE is
used
So we shouldn't even need to care about this possibility and can drop
the confusing code. Let's keep a WARN_ON_ONCE in place to catch
potential users and fix them up properly (aka use a different allocation
function which ignores mempolicy).
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While doing MADV_DONTNEED on a large area of thp memory, I noticed we
encountered many unlikely() branches in profiles for each backing
hugepage. This is because zap_pmd_range() would call split_huge_pmd(),
which rechecked the conditions that were already validated, but as part
of an unlikely() branch.
Avoid the unlikely() branch when in a context where pmd is known to be
good for __split_huge_pmd() directly.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many seq_file helpers exist for simplifying implementation of virtual
files especially, for /proc nodes. however, the helpers for iteration
over list_head are available but aren't adopted to implement
/proc/vmallocinfo currently.
Simplify /proc/vmallocinfo implementation by using existing seq_file
helpers.
Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, unreserve_highatomic_pageblock bails out if it found
highatomic pageblock regardless of really moving free pages from the one
so that it could mitigate unreserve logic's goal which saves OOM of a
process.
This patch makes unreserve functions bail out only if it moves some
pages out of !highatomic free list to avoid such false positive.
Another potential problem is that by race between page freeing and
reserve highatomic function, pages could be in highatomic free list even
though the pageblock is !high atomic migratetype. In that case,
unreserve_highatomic_pageblock can be void if count of highatomic
reserve is less than pageblock_nr_pages. We could solve it simply via
draining all of reserved pages before the OOM. It would have a
safeguard role to exhuast reserved pages before converging to OOM.
Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is race between page freeing and unreserved highatomic.
CPU 0 CPU 1
free_hot_cold_page
mt = get_pfnblock_migratetype
set_pcppage_migratetype(page, mt)
unreserve_highatomic_pageblock
spin_lock_irqsave(&zone->lock)
move_freepages_block
set_pageblock_migratetype(page)
spin_unlock_irqrestore(&zone->lock)
free_pcppages_bulk
__free_one_page(mt) <- mt is stale
By above race, a page on CPU 0 could go non-highorderatomic free list
since the pageblock's type is changed. By that, unreserve logic of
highorderatomic can decrease reserved count on a same pageblock severak
times and then it will make mismatch between nr_reserved_highatomic and
the number of reserved pageblock.
So, this patch verifies whether the pageblock is highatomic or not and
decrease the count only if the pageblock is highatomic.
Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "use up highorder free pages before OOM", v3.
I got OOM report from production team with v4.4 kernel. It had enough
free memory but failed to allocate GFP_KERNEL order-0 page and finally
encountered OOM kill. It occured during QA process which launches
several apps, switching and so on. It happned rarely. IOW, In normal
situation, it was not a problem but if we are unluck so that several
apps uses peak memory at the same time, it can happen. If we manage to
pass the phase, the system can go working well.
I could reproduce it with my test(memory spike easily. Look at below.
The reason is free pages(19M) of DMA32 zone are reserved for
HIGHORDERATOMIC and doesn't unreserved before the OOM.
balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
balloon cpuset=/ mems_allowed=0
CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
dump_header+0x5c/0x1ce
oom_kill_process+0x22e/0x400
out_of_memory+0x1ac/0x210
__alloc_pages_nodemask+0x101e/0x1040
handle_mm_fault+0xa0a/0xbf0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:383949 inactive_anon:106724 isolated_anon:0
active_file:15 inactive_file:44 isolated_file:0
unevictable:0 dirty:0 writeback:24 unstable:0
slab_reclaimable:2483 slab_unreclaimable:3326
mapped:0 shmem:0 pagetables:1906 bounce:0
free:6898 free_pcp:291 free_cma:0
Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
51131 total pagecache pages
50795 pages in swap cache
Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
Free swap = 8kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12658 pages reserved
0 pages cma reserved
0 pages hwpoisoned
Another example exceeded the limit by the race is
in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x63/0x90
warn_alloc_failed+0xdb/0x130
__alloc_pages_nodemask+0x4d6/0xdb0
new_slab+0x339/0x490
___slab_alloc.constprop.74+0x367/0x480
__slab_alloc.constprop.73+0x20/0x40
__kmalloc+0x1a4/0x1e0
alloc_indirect.isra.14+0x1d/0x50
virtqueue_add_sgs+0x1c4/0x470
__virtblk_add_req+0xae/0x1f0
virtio_queue_rq+0x12d/0x290
__blk_mq_run_hw_queue+0x239/0x370
blk_mq_run_hw_queue+0x8f/0xb0
blk_mq_insert_requests+0x18c/0x1a0
blk_mq_flush_plug_list+0x125/0x140
blk_flush_plug_list+0xc7/0x220
blk_finish_plug+0x2c/0x40
__do_page_cache_readahead+0x196/0x230
filemap_fault+0x448/0x4f0
ext4_filemap_fault+0x36/0x50
__do_fault+0x75/0x140
handle_mm_fault+0x84d/0xbe0
__do_page_fault+0x1dd/0x4d0
trace_do_page_fault+0x43/0x130
do_async_page_fault+0x1a/0xa0
async_page_fault+0x28/0x30
Mem-Info:
active_anon:363826 inactive_anon:121283 isolated_anon:32
active_file:65 inactive_file:152 isolated_file:0
unevictable:0 dirty:0 writeback:46 unstable:0
slab_reclaimable:2778 slab_unreclaimable:3070
mapped:112 shmem:0 pagetables:1822 bounce:0
free:9469 free_pcp:231 free_cma:0
Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1952 1952 1952
DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
2775 total pagecache pages
2536 pages in swap cache
Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
Free swap = 108744kB
Total swap = 255996kB
524158 pages RAM
0 pages HighMem/MovableOnly
12648 pages reserved
0 pages cma reserved
0 pages hwpoisoned
During the investigation, I found some problems with highatomic so this
patch aims to solve the problems and the final goal is to unreserve
every highatomic free pages before the OOM kill.
This patch (of 4):
In page freeing path, migratetype is racy so that a highorderatomic page
could free into non-highorderatomic free list. If that page is
allocated, VM can change the pageblock from higorderatomic to something.
In that case, highatomic pageblock accounting is broken so it doesn't
work(e.g., VM cannot reserve highorderatomic pageblocks any more
although it doesn't reach 1% limit).
So, this patch prohibits the changing from highatomic to other type.
It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
array so stealing will only happen due to unexpected races which is
really rare. Also, such prohibiting keeps highatomic pageblock more
longer so it would be better for highorderatomic page allocation.
Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
this fixes the reference to the new location.
Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.at
Signed-off-by: Andreas Platschek <andreas.platschek@opentech.at>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cannot use the pte value used in set_pte_at for pte_same comparison,
because archs like ppc64, filter/add new pte flag in set_pte_at.
Instead fetch the pte value inside hugetlb_cow. We are comparing pte
value to make sure the pte didn't change since we dropped the page table
lock. hugetlb_cow get called with page table lock held, and we can take
a copy of the pte value before we drop the page table lock.
With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
previous mapping (huge_pte_none entries), by forcing a cow in the fault
path. This avoid take an addition fault to covert a read-only mapping
to read/write. Here we were comparing a recently instantiated pte (via
set_pte_at) to the pte values from linux page table. As explained above
on ppc64 such pte_same check returned wrong result, resulting in us
taking an additional fault on ppc64.
Fixes: 6a119eae94 ("powerpc/mm: Add a _PAGE_PTE bit")
Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make vma_permits_fault() static as it is only used in mm/gup.c
This fixes a sparse warning.
Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.ch
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our system uses significantly more slab memory with memcg enabled with
the latest kernel. With 3.10 kernel, slab uses 2G memory, while with
4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we
have two memcg for one shrinker. In do_shrink_slab:
1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size
is 1024, then no memory is freed. nr_deferred = 700
2. Check cg2. nr_deferred = 700. Assume freeable = 20, then
total_scan = 10 or 40. Let's assume it's 10. No memory is freed.
nr_deferred = 10.
The deferred share of cg1 is lost in this case. kswapd will free no
memory even run above steps again and again.
The fix makes sure one memcg's deferred share isn't lost.
Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We had some problems with pages getting unmapped in single threaded
affinitized processes. It was tracked down to NUMA scanning.
In this case it doesn't make any sense to unmap pages if the process is
single threaded and the page is already on the node the process is
running on.
Add a check for this case into the numa protection code, and skip
unmapping if true.
In theory the process could be migrated later, but we will eventually
rescan and unmap and migrate then.
In theory this could be made more fancy: remembering this state per
process or even whole mm. However that would need extra tracking and be
more complicated, and the simple check seems to work fine so far.
[ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than tracking the number of active slabs for each node, track the
total number of slabs. This is a minor improvement that avoids active
slab tracking when a slab goes from free to partial or partial to free.
For slab debugging, this also removes an explicit free count since it
can easily be inferred by the difference in number of total objects and
number of active objects.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reading /proc/slabinfo or monitoring slabtop(1) can become very
expensive if there are many slab caches and if there are very lengthy
per-node partial and/or free lists.
Commit 07a63c41fa ("mm/slab: improve performance of gathering slabinfo
stats") addressed the per-node full lists which showed a significant
improvement when no objects were freed. This patch has the same
motivation and optimizes the remainder of the usecases where there are
very lengthy partial and free lists.
This patch maintains per-node active_slabs (full and partial) and
free_slabs rather than iterating the lists at runtime when reading
/proc/slabinfo.
When allocating 100GB of slab from a test cache where every slab page is
on the partial list, reading /proc/slabinfo (includes all other slab
caches on the system) takes ~247ms on average with 48 samples.
As a result of this patch, the same read takes ~0.856ms on average.
[rientjes@google.com: changelog]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Verify that kmem_create_cache flags are not allocator specific. It is
done before removing flags that are not available with the current
configuration.
The current kmem_cache_create removes incorrect flags but do not
validate the callers are using them right. This change will ensure that
callers are not trying to create caches with flags that won't be used
because allocator specific.
Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The slub allocator gives us some incorrect warnings when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
prevents it from seeing that the return code matches what it was before:
mm/slub.c: In function `kmem_cache_free_bulk':
mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
I have not been able to come up with a perfect way for dealing with
this, the three options I see are:
- add a bogus initialization, which would increase the runtime overhead
- replace unlikely() with unlikely_notrace()
- remove the unlikely() annotation completely
I checked the object code for a typical x86 configuration and the last
two cases produce the same result, so I went for the last one, which is
the simplest.
Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <labbott@fedoraproject.org>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
synchronize_sched() is a heavy operation and calling it per each cache
owned by a memory cgroup being destroyed may take quite some time. What
is worse, it's currently called under the slab_mutex, stalling all works
doing cache creation/destruction.
Actually, there isn't much point in calling synchronize_sched() for each
cache - it's enough to call it just once - after setting cpu_partial for
all caches and before shrinking them. This way, we can also move it out
of the slab_mutex, which we have to hold for iterating over the slab
cache list.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Creating a lot of cgroups at the same time might stall all worker
threads with kmem cache creation works, because kmem cache creation is
done with the slab_mutex held. The problem was amplified by commits
801faf0db8 ("mm/slab: lockless decision to grow cache") in case of
SLAB and 81ae6d0395 ("mm/slub.c: replace kick_all_cpus_sync() with
synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
increased the maximal time the slab_mutex can be held.
To prevent that from happening, let's use a special ordered single
threaded workqueue for kmem cache creation. This shouldn't introduce
any functional changes regarding how kmem caches are created, as the
work function holds the global slab_mutex during its whole runtime
anyway, making it impossible to run more than one work at a time. By
using a single threaded workqueue, we just avoid creating a thread per
each work. Ordering is required to avoid a situation when a cgroup's
work is put off indefinitely because there are other cgroups to serve,
in other words to guarantee fairness.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 asm updates from Ingo Molnar:
"The main changes in this development cycle were:
- a large number of call stack dumping/printing improvements: higher
robustness, better cross-context dumping, improved output, etc.
(Josh Poimboeuf)
- vDSO getcpu() performance improvement for future Intel CPUs with
the RDPID instruction (Andy Lutomirski)
- add two new Intel AVX512 features and the CPUID support
infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
He Chen)
- more copy-user unification (Borislav Petkov)
- entry code assembly macro simplifications (Alexander Kuleshov)
- vDSO C/R support improvements (Dmitry Safonov)
- misc fixes and cleanups (Borislav Petkov, Paul Bolle)"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
scripts/decode_stacktrace.sh: Fix address line detection on x86
x86/boot/64: Use defines for page size
x86/dumpstack: Make stack name tags more comprehensible
selftests/x86: Add test_vdso to test getcpu()
x86/vdso: Use RDPID in preference to LSL when available
x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
x86/cpufeatures: Enable new AVX512 cpu features
x86/cpuid: Provide get_scattered_cpuid_leaf()
x86/cpuid: Cleanup cpuid_regs definitions
x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
x86/unwind: Ensure stack grows down
x86/vdso: Set vDSO pointer only after success
x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
x86/unwind: Detect bad stack return address
x86/dumpstack: Warn on stack recursion
x86/unwind: Warn on bad frame pointer
x86/decoder: Use stderr if insn sanity test fails
x86/decoder: Use stdout if insn decoder test is successful
mm/page_alloc: Remove kernel address exposure in free_reserved_area()
x86/dumpstack: Remove raw stack dump
...
Pull mm/PAT cleanup from Ingo Molnar:
"A single cleanup for a generic interface that was originally
introduced for PAT"
* 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pat, mm: Make track_pfn_insert() return void
The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
want to race with generating new pages by faulting them in.
However, the wait-queue used to delay the page faulting has a serious
problem: the wait queue head (in shmem_fallocate()) is allocated on the
stack, and the code expects that "wake_up_all()" will make sure that all
the queue entries are gone before the stack frame is de-allocated.
And that is not at all necessarily the case.
Yes, a normal wake-up sequence will remove the wait-queue entry that
caused the wakeup (see "autoremove_wake_function()"), but the key
wording there is "that caused the wakeup". When there are multiple
possible wakeup sources, the wait queue entry may well stay around.
And _particularly_ in a page fault path, we may be faulting in new pages
from user space while we also have other things going on, and there may
well be other pending wakeups.
So despite the "wake_up_all()", it's not at all guaranteed that all list
entries are removed from the wait queue head on the stack.
Fix this by introducing a new wakeup function that removes the list
entry unconditionally, even if the target process had already woken up
for other reasons. Use that "synchronous" function to set up the
waiters in shmem_fault().
This problem has never been seen in the wild afaik, but Dave Jones has
reported it on and off while running trinity. We thought we fixed the
stack corruption with the blk-mq rq_list locking fix (commit
7fe311302f: "blk-mq: update hardware and software queues for sleeping
alloc"), but it turns out there was _another_ stack corruptor hiding
in the trinity runs.
Vegard Nossum (also running trinity) was able to trigger this one fairly
consistently, and made us look once again at the shmem code due to the
faults often being in that area.
Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
INFO: rcu_sched detected stalls on CPUs/tasks:
23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
(detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
Task dump for CPU 23:
kswapd1 R running task 0 148 2 0x00000008
Call Trace:
shrink_node+0xd2/0x2f0
kswapd+0x2cb/0x6a0
mem_cgroup_shrink_node+0x160/0x160
kthread+0xbd/0xe0
__switch_to+0x1fa/0x5c0
ret_from_fork+0x1f/0x40
kthread_create_on_node+0x180/0x180
a closer code inspection has shown that we might indeed miss all the
scheduling points in the reclaim path if no pages can be isolated from
the LRU list. This is a pathological case but other reports from Donald
Buczek have shown that we might indeed hit such a path:
clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
[...]
kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
this is minute long snapshot which didn't take a single page from the
LRU. It is not entirely clear why only 1303 pages have been scanned
during that time (maybe there was a heavy IRQ activity interfering).
In any case it looks like we can really hit long periods without
scheduling on non preemptive kernels so an explicit cond_resched() in
shrink_node_memcg which is independent on the reclaim operation is due.
Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Boris Zhmurov <bb@kernelpanic.ru>
Tested-by: Boris Zhmurov <bb@kernelpanic.ru>
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Reported-by: "Christopher S. Aker" <caker@theshore.net>
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Install the callbacks via the state machine. Should the hotplug init fail then
no threads are spawned.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161126231350.10321-15-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Install the callbacks via the state machine. Multi state is used to address the
per-pool notifier. Uppon adding of the intance the callback is invoked for all
online CPUs so the manual init can go.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Seth Jennings <sjenning@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161126231350.10321-13-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-mm@kvack.org
Cc: Minchan Kim <minchan@kernel.org>
Cc: rt@linutronix.de
Cc: Nitin Gupta <ngupta@vflare.org>
Link: http://lkml.kernel.org/r/20161126231350.10321-11-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Install the callbacks via the state machine, but do not invoke them as we
can initialize the node state without calling the callbacks on all online
CPUs.
start_shepherd_timer() is now called outside the get_online_cpus() block
which is safe as it only operates on cpu possible mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Both iterations over online cpus can be replaced by the proper node
specific functions.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Both functions are called with protection against cpu hotplug already so
*_online_cpus() could be dropped.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we
cannot user page_to_pgoff() to check whether we've got the right page
for the radix-tree index.
Let's introduce page_to_index() which would return radix-tree index for
given page.
We will be able to get rid of this once hugetlb will be switched to
multi-order entries.
Fixes: fc127da085 ("truncate: handle file thp")
Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Doug Nelson <doug.nelson@intel.com>
Tested-by: Doug Nelson <doug.nelson@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gcc revision 241896 implements use-after-scope detection. Will be
available in gcc 7. Support it in KASAN.
Gcc emits 2 new callbacks to poison/unpoison large stack objects when
they go in/out of scope. Implement the callbacks and add a test.
[dvyukov@google.com: v3]
Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com
Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan_global struct is part of compiler/runtime ABI. gcc revision
241983 has added a new field to kasan_global struct. Update kernel
definition of kasan_global struct to include the new field.
Without this patch KASAN is broken with gcc 7.
Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following program triggers BUG() in munlock_vma_pages_range():
// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include <sys/mman.h>
int main()
{
mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
return 0;
}
The test-case constructs the situation when munlock_vma_pages_range()
finds PTE-mapped THP-head in the middle of page table and, by mistake,
skips HPAGE_PMD_NR pages after that.
As result, on the next iteration it hits the middle of PMD-mapped THP
and gets upset seeing mlocked tail page.
The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
during munlock_vma_page(). It would guarantee that the page is
PMD-mapped as we never mlock PTE-mapeed THPs.
Fixes: e90309c9f7 ("thp: allow mlocked THP again")
Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b46e756f5e ("thp: extract khugepaged from mm/huge_memory.c")
moved code from huge_memory.c to khugepaged.c. Some of this code should
be compiled only when CONFIG_SYSFS is enabled but the condition around
this code was not moved into khugepaged.c.
The result is a compilation error when CONFIG_SYSFS is disabled:
mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'
This commit adds the #ifdef CONFIG_SYSFS around the code related to
sysfs.
Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus found there still is a race in mremap after commit 5d1904204c
("mremap: fix race between mremap() and page cleanning").
As described by Linus:
"the issue is that another thread might make the pte be dirty (in the
hardware walker, so no locking of ours will make any difference)
*after* we checked whether it was dirty, but *before* we removed it
from the page tables"
Fix it by moving the check after we removed it from the page table.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Limit the number of kmemleak false positives by including
.data.ro_after_init in memory scanning. To achieve this we need to add
symbols for start and end of the section to the linker scripts.
The problem was been uncovered by commit 56989f6d85 ("genetlink: mark
families as __ro_after_init").
Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While testing OBJFREELIST_SLAB integration with pagealloc, we found a
bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
for loading drivers or creating new caches will fail.
The original kmem_cache is created early making OFF_SLAB not possible.
When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
is enabled it will try to enable it first under certain conditions.
Given kmem_cache(sys) reuses the original flag, you can have both flags
at the same time resulting in allocation failures and odd behaviors.
This fix discards allocator specific flags from memcg before calling
create_cache.
The bug exists since 4.6-rc1 and affects testing debug pagealloc
configurations.
Fixes: b03a017beb ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Tested-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting from 4.9-rc1 kernel, I started noticing some test failures of
sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
testing on sub-page block size filesystems (tested both XFS and ext4),
these syscalls start to return EIO in the tests. e.g.
sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
This is because that in sub-page block size cases, we don't need the
whole page to be uptodate, only the part we care about is uptodate is OK
(if fs has ->is_partially_uptodate defined).
But page_cache_pipe_buf_confirm() doesn't have the ability to check the
partially-uptodate case, it needs the whole page to be uptodate. So it
returns EIO in this case.
This is a regression introduced by commit 82c156f853 ("switch
generic_file_splice_read() to use of ->read_iter()"). Prior to the
change, generic_file_splice_read() doesn't allow partially-uptodate page
either, so it worked fine.
Fix it by skipping the partially-uptodate check if we're working on a
pipe in do_generic_file_read(), so we read the whole page from disk as
long as the page is not uptodate.
I think the other way to fix it is to add the ability to check & allow
partially-uptodate page to page_cache_pipe_buf_confirm(), but that is
much harder to do and seems gain little.
Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.com
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
allocated huge page.
If a reservation was associated with the huge page, alloc_huge_page()
consumed the reservation while allocating. When the newly allocated
page is freed in free_huge_page(), it will increment the global
reservation count. However, the reservation entry in the reserve map
will remain.
This is not an issue for shared mappings as the entry in the reserve map
indicates a reservation exists. But, an entry in a private mapping
reserve map indicates the reservation was consumed and no longer exists.
This results in an inconsistency between the reserve map and the global
reservation count. This 'leaks' a reserved huge page.
Create a new routine restore_reserve_on_error() to restore the reserve
entry in these specific error paths. This routine makes use of a new
function vma_add_reservation() which will add a reserve entry for a
specific address/page.
In general, these error paths were rarely (if ever) taken on most
architectures. However, powerpc contained arch specific code that that
resulted in an extra fault and execution of these error paths on all
private mappings.
Fixes: 67961f9db8 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A . Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memory_failure() runs on a thp tail page after pmd is split, we
trigger the following VM_BUG_ON_PAGE():
page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1
flags: 0x1fffc000400000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(!page_count(p))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/memory-failure.c:1132!
memory_failure() passed refcount and page lock from tail page to head
page, which is not needed because we can pass any subpage to
split_huge_page().
Fixes: 61f5d698cc ("mm: re-enable THP")
Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When root activates a swap partition whose header has the wrong
endianness, nr_badpages elements of badpages are swabbed before
nr_badpages has been checked, leading to a buffer overrun of up to 8GB.
This normally is not a security issue because it can only be exploited
by root (more specifically, a process with CAP_SYS_ADMIN or the ability
to modify a swap file/partition), and such a process can already e.g.
modify swapped-out memory of any other userspace process on the system.
Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CMA allocation request size is represented by size_t that gets truncated
when same is passed as int to bitmap_find_next_zero_area_off.
We observe that during fuzz testing when cma allocation request is too
high, bitmap_find_next_zero_area_off still returns success due to the
truncation. This leads to kernel crash, as subsequent code assumes that
requested memory is available.
Fail cma allocation in case the request breaches the corresponding cma
region size.
Link: http://lkml.kernel.org/r/1478189211-3467-1-git-send-email-shashim@codeaurora.org
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If shmem_alloc_page() does not set PageLocked and PageSwapBacked, then
shmem_replace_page() needs to do so for itself. Without this, it puts
newpage on the wrong lru, re-unlocks the unlocked newpage, and system
descends into "Bad page" reports and freeze; or if CONFIG_DEBUG_VM=y, it
hits an earlier VM_BUG_ON_PAGE(!PageLocked), depending on config.
But shmem_replace_page() is not a common path: it's only called when
swapin (or swapoff) finds the page was already read into an unsuitable
zone: usually all zones are suitable, but gem objects for a few drm
devices (gma500, omapdrm, crestline, broadwater) require zone DMA32 if
there's more than 4GB of ram.
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1611062003510.11253@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.8.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The flush_icache_range() API is meant to be used on kernel addresses
only as it may not have the infrastructure (exception entries) to handle
user memory faults.
The lkdtm execute_user_location() function tests the kernel execution of
user space addresses by mmap'ing an anonymous page, copying some code
together with cache maintenance and attempting to run it. However, the
cache maintenance step may fail because of the incorrect API usage
described above. The patch changes lkdtm to use access_process_vm() for
copying the code into user space which would take care of the necessary
cache maintenance.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[kees: export access_process_vm() for module use]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.
There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries. The empty entries exist to provide locking for the duration of a
given page fault.
This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.
Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees. This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.
One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset. The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases. This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.
If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry. We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry. The 'mapping'
pointer is still used in the same way.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The stack frame size could grow too large when the plugin used long long
on 32-bit architectures when the given function had too many basic blocks.
The gcc warning was:
drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]
This switches latent_entropy from u64 to unsigned long.
Thanks to PaX Team and Emese Revfy for the patch.
Signed-off-by: Kees Cook <keescook@chromium.org>
Merge misc fixes from Andrew Morton:
"20 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
cris/arch-v32: cryptocop: print a hex number after a 0x prefix
ipack: print a hex number after a 0x prefix
block: DAC960: print a hex number after a 0x prefix
fs: exofs: print a hex number after a 0x prefix
lib/genalloc.c: start search from start of chunk
mm: memcontrol: do not recurse in direct reclaim
CREDITS: update credit information for Martin Kepplinger
proc: fix NULL dereference when reading /proc/<pid>/auxv
mm: kmemleak: ensure that the task stack is not freed during scanning
lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
latent_entropy: raise CONFIG_FRAME_WARN by default
kconfig.h: remove config_enabled() macro
ipc: account for kmem usage on mqueue and msg
mm/slab: improve performance of gathering slabinfo stats
mm: page_alloc: use KERN_CONT where appropriate
mm/list_lru.c: avoid error-path NULL pointer deref
h8300: fix syscall restarting
kcov: properly check if we are in an interrupt
mm/slab: fix kmemcg cache creation delayed issue
On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:
[...]
btrfs_releasepage+0x2c/0x30
try_to_release_page+0x32/0x50
shrink_page_list+0x6da/0x7a0
shrink_inactive_list+0x1e5/0x510
shrink_lruvec+0x605/0x7f0
shrink_zone+0xee/0x320
do_try_to_free_pages+0x174/0x440
try_to_free_mem_cgroup_pages+0xa7/0x130
try_charge+0x17b/0x830
memcg_charge_kmem+0x40/0x80
new_slab+0x2d9/0x5a0
__slab_alloc+0x2fd/0x44f
kmem_cache_alloc+0x193/0x1e0
alloc_extent_state+0x21/0xc0
__clear_extent_bit+0x2b5/0x400
try_release_extent_mapping+0x1a3/0x220
__btrfs_releasepage+0x31/0x70
btrfs_releasepage+0x2c/0x30
try_to_release_page+0x32/0x50
shrink_page_list+0x6da/0x7a0
shrink_inactive_list+0x1e5/0x510
shrink_lruvec+0x605/0x7f0
shrink_zone+0xee/0x320
do_try_to_free_pages+0x174/0x440
try_to_free_mem_cgroup_pages+0xa7/0x130
try_charge+0x17b/0x830
mem_cgroup_try_charge+0x65/0x1c0
handle_mm_fault+0x117f/0x1510
__do_page_fault+0x177/0x420
do_page_fault+0xc/0x10
page_fault+0x22/0x30
On later kernels, kmem charging is opt-in rather than opt-out, and that
particular kmem allocation in btrfs_releasepage() is no longer being
charged and won't recurse and overrun the stack anymore.
But it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash many
times before we even got a useful stack trace out of it.
Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 68f24b08ee ("sched/core: Free the stack early if
CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
during kmemleak_scan() execution, leading to either a NULL pointer fault
(if task->stack is NULL) or kmemleak accessing already freed memory.
This patch uses the new try_get_task_stack() API to ensure that the task
stack is not freed during kmemleak stack scanning.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=173901.
Fixes: 68f24b08ee ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")
Link: http://lkml.kernel.org/r/1476266223-14325-1-git-send-email-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On large systems, when some slab caches grow to millions of objects (and
many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
seconds. During this time, interrupts are disabled while walking the
slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
and this sometimes causes timeouts in other drivers (for instance,
Infiniband).
This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
total number of allocated slabs per node, per cache. This counter is
updated when a slab is created or destroyed. This enables us to skip
traversing the slabs_full list while gathering slabinfo statistics, and
since slabs_full tends to be the biggest list when the cache is large,
it results in a dramatic performance improvement. Getting slabinfo
statistics now only requires walking the slabs_free and slabs_partial
lists, and those lists are usually much smaller than slabs_full.
We tested this after growing the dentry cache to 70GB, and the
performance improved from 2s to 5ms.
Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:
After some analysis it seems to be that the problem is in alloc_super().
In case list_lru_init_memcg() fails it goes into destroy_super(), which
calls list_lru_destroy().
And in list_lru_init() we see that in case memcg_init_list_lru() fails,
lru->node is freed, but not set NULL, which then leads list_lru_destroy()
to believe it is initialized and call memcg_destroy_list_lru().
memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
which is NULL.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Alexander Polakov <apolyakov@beget.ru>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a bug report that SLAB makes extreme load average due to over
2000 kworker thread.
https://bugzilla.kernel.org/show_bug.cgi?id=172981
This issue is caused by kmemcg feature that try to create new set of
kmem_caches for each memcg. Recently, kmem_cache creation is slowed by
synchronize_sched() and futher kmem_cache creation is also delayed since
kmem_cache creation is synchronized by a global slab_mutex lock. So,
the number of kworker that try to create kmem_cache increases quietly.
synchronize_sched() is for lockless access to node's shared array but
it's not needed when a new kmem_cache is created. So, this patch rules
out that case.
Fixes: 801faf0db8 ("mm/slab: lockless decision to grow cache")
Link: http://lkml.kernel.org/r/1475734855-4837-1-git-send-email-iamjoonsoo.kim@lge.com
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
but for build testing there is no reason not to allow them together.
This hopefully means better build coverage and fewer embarrasing silly
problems like the one fixed by commit 9db4f36e82 ("mm: remove unused
variable in memory hotplug") in the future.
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I removed the per-zone bitlock hashed waitqueues in commit
9dcb8b685f ("mm: remove per-zone hashtable of bitlock waitqueues"), I
removed all the magic hotplug memory initialization of said waitqueues
too.
But when I actually _tested_ the resulting build, I stupidly assumed
that "allmodconfig" would enable memory hotplug. And it doesn't,
because it enables KASAN instead, which then disables hotplug memory
support.
As a result, my build test of the per-zone waitqueues was totally
broken, and I didn't notice that the compiler warns about the now unused
iterator variable 'i'.
I guess I should be happy that that seems to be the worst breakage from
my clearly horribly failed test coverage.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:
wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().
The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).
It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.
As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.
Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus suggested we try to remove some of the low-hanging fruit related
to kernel address exposure in dmesg. The only leaks I see on my local
system are:
Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000)
Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000)
Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000)
Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000)
Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000)
Linus says:
"I suspect we should just remove [the addresses in the 'Freeing'
messages]. I'm sure they are useful in theory, but I suspect they
were more useful back when the whole "free init memory" was
originally done.
These days, if we have a use-after-free, I suspect the init-mem
situation is the easiest situation by far. Compared to all the dynamic
allocations which are much more likely to show it anyway. So having
debug output for that case is likely not all that productive."
With this patch the freeing messages now look like this:
Freeing SMP alternatives memory: 32K
Freeing initrd memory: 10588K
Freeing unused kernel memory: 3592K
Freeing unused kernel memory: 1352K
Freeing unused kernel memory: 632K
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch unexports the low-level __get_user_pages() function.
Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.
We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:
get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vmap stack fixes from Ingo Molnar:
"This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
accesses that used to be just somewhat questionable are now totally
buggy.
These changes try to do it without breaking the ABI: the fields are
left there, they are just reporting zero, or reporting narrower
information (the maps file change)"
* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
fs/proc: Stop trying to report thread stacks
fs/proc: Stop reporting eip and esp in /proc/PID/stat
mm/numa: Remove duplicated include from mprotect.c
Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode. As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.
The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The percpu allocator expectedly assumes that the requested alignment
is power of two but hasn't been veryfing the input. If the specified
alignment isn't power of two, the allocator can malfunction. Add the
sanity check.
The following is detailed analysis of the effects of alignments which
aren't power of two.
The alignment must be a even at least since the LSB of a chunk->map
element is used as free/in-use flag of a area; besides, the alignment
must be a power of 2 too since ALIGN() doesn't work well for other
alignment always but is adopted by pcpu_fit_in_area(). IOW, the
current allocator only works well for a power of 2 aligned area
allocation.
See below opposite example for why an odd alignment doesn't work.
Let's assume area [16, 36) is free but its previous one is in-use, we
want to allocate a @size == 8 and @align == 7 area. The larger area
[16, 36) is split to three areas [16, 21), [21, 29), [29, 36)
eventually. However, due to the usage for a chunk->map element, the
actual offset of the aim area [21, 29) is 21 but is recorded in
relevant element as 20; moreover, the residual tail free area [29,
36) is mistook as in-use and is lost silently
Unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power
of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and
the latter result isn't desired obviously.
tj: Code style and patch description updates.
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge the gup_flags cleanups from Lorenzo Stoakes:
"This patch series adjusts functions in the get_user_pages* family such
that desired FOLL_* flags are passed as an argument rather than
implied by flags.
The purpose of this change is to make the use of FOLL_FORCE explicit
so it is easier to grep for and clearer to callers that this flag is
being used. The use of FOLL_FORCE is an issue as it overrides missing
VM_READ/VM_WRITE flags for the VMA whose pages we are reading
from/writing to, which can result in surprising behaviour.
The patch series came out of the discussion around commit 38e0885465
("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
which addressed a BUG_ON() being triggered when a page was faulted in
with PROT_NONE set but having been overridden by FOLL_FORCE.
do_numa_page() was run on the assumption the page _must_ be one marked
for NUMA node migration as an actual PROT_NONE page would have been
dealt with prior to this code path, however FOLL_FORCE introduced a
situation where this assumption did not hold.
See
https://marc.info/?l=linux-mm&m=147585445805166
for the patch proposal"
Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.
[ This branch was rebased recently to add a few more acked-by's and
reviewed-by's ]
* gup_flag-cleanups:
mm: replace access_process_vm() write parameter with gup_flags
mm: replace access_remote_vm() write parameter with gup_flags
mm: replace __access_remote_vm() write parameter with gup_flags
mm: replace get_user_pages_remote() write/force parameters with gup_flags
mm: replace get_user_pages() write/force parameters with gup_flags
mm: replace get_vaddr_frames() write/force parameters with gup_flags
mm: replace get_user_pages_locked() write/force parameters with gup_flags
mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
mm: remove write/force parameters from __get_user_pages_unlocked()
mm: remove write/force parameters from __get_user_pages_locked()
mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' argument from access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' argument from __access_remote_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' from get_user_pages() and replaces
them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
as use of this flag can result in surprising behaviour (and hence bugs)
within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' from get_vaddr_frames() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' use from get_user_pages_locked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the redundant 'write' and 'force' parameters from
__get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the redundant 'write' and 'force' parameters from
__get_user_pages_locked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db975 ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404 ("fix get_user_pages bug").
In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3c ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.
Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.
To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
pa/A7CNQwibIV6YD8+/p
=1dUK
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
Pull percpu updates from Tejun Heo:
- Nick improved generic implementations of percpu operations which
modify the variable and return so that they calculate the physical
address only once.
- percpu_ref percpu <-> atomic mode switching improvements. The
patchset was originally posted about a year ago but fell through the
crack.
- misc non-critical fixes.
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
percpu: eliminate two sparse warnings
percpu: improve generic percpu modify-return implementation
percpu-refcount: init ->confirm_switch member properly
percpu_ref: allow operation mode switching operations to be called concurrently
percpu_ref: restructure operation mode switching
percpu_ref: unify staggered atomic switching wait behavior
percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic()
percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation
This affectively reverts commit 377ccbb483 ("Makefile: Mute warning
for __builtin_return_address(>0) for tracing only") because it turns out
that it really isn't tracing only - it's all over the tree.
We already also had the warning disabled separately for mm/usercopy.c
(which this commit also removes), and it turns out that we will also
want to disable it for get_lock_parent_ip(), that is used for at least
TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.
Steven Rostedt had a patch that tried to limit it to just the config
options that actually triggered this, but quite frankly, the extra
complexity and abstraction just isn't worth it. We have never actually
had a case where the warning is actually useful, so let's just disable
it globally and not worry about it.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the kmemleak_*() callbacks in memblock, bootmem, CMA convert a
physical address to a virtual one using __va(). However, such physical
addresses may sometimes be located in highmem and using __va() is
incorrect, leading to inconsistent object tracking in kmemleak.
The following functions have been added to the kmemleak API and they take
a physical address as the object pointer. They only perform the
corresponding action if the address has a lowmem mapping:
kmemleak_alloc_phys
kmemleak_free_part_phys
kmemleak_not_leak_phys
kmemleak_ignore_phys
The affected calling places have been updated to use the new kmemleak
API.
Link: http://lkml.kernel.org/r/1471531432-16503-1-git-send-email-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Vignesh R <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning