This function is getting full of weird tricks to avoid word-wrapping.
Use a goto to eliminate a tab stop then use the new space
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS
complied with UEFI spec 2.5 can notify which ranges are mirrored
(reliable) via EFI memory map. Now Linux kernel utilize its information
and allocates boot time memory from reliable region.
My requirement is:
- allocate kernel memory from mirrored region
- allocate user memory from non-mirrored region
In order to meet my requirement, ZONE_MOVABLE is useful. By arranging
non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel
allocations.
My idea is to extend existing "kernelcore" option and introduces
kernelcore=mirror option. By specifying "mirror" instead of specifying
the amount of memory, non-mirrored region will be arranged into
ZONE_MOVABLE.
Earlier discussions are at:
https://lkml.org/lkml/2015/10/9/24https://lkml.org/lkml/2015/10/15/9https://lkml.org/lkml/2015/11/27/18https://lkml.org/lkml/2015/12/8/836
For example, suppose 2-nodes system with the following memory range:
node 0 [mem 0x0000000000001000-0x000000109fffffff]
node 1 [mem 0x00000010a0000000-0x000000209fffffff]
and the following ranges are marked as reliable (mirrored):
[0x0000000000000000-0x0000000100000000]
[0x0000000100000000-0x0000000180000000]
[0x0000000800000000-0x0000000880000000]
[0x00000010a0000000-0x0000001120000000]
[0x00000017a0000000-0x0000001820000000]
If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are
arranged like bellow:
- node 0:
ZONE_NORMAL : [0x0000000100000000-0x00000010a0000000]
ZONE_MOVABLE: [0x0000000180000000-0x00000010a0000000]
- node 1:
ZONE_NORMAL : [0x00000010a0000000-0x00000020a0000000]
ZONE_MOVABLE: [0x0000001120000000-0x00000020a0000000]
In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated
as absent pages, and vice versa.
This patch (of 2):
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at the time when
invoking zone_spanned_pages_in_node().
This patch changes how each zone->zone_start_pfn is calculated in
zone_spanned_pages_in_node().
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLUB already has a redzone debugging feature. But it is only positioned
at the end of object (aka right redzone) so it cannot catch left oob.
Although current object's right redzone acts as left redzone of next
object, first object in a slab cannot take advantage of this effect.
This patch explicitly adds a left red zone to each object to detect left
oob more precisely.
Background:
Someone complained to me that left OOB doesn't catch even if KASAN is
enabled which does page allocation debugging. That page is out of our
control so it would be allocated when left OOB happens and, in this
case, we can't find OOB. Moreover, SLUB debugging feature can be
enabled without page allocator debugging and, in this case, we will miss
that OOB.
Before trying to implement, I expected that changes would be too
complex, but, it doesn't look that complex to me now. Almost changes
are applied to debug specific functions so I feel okay.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When debug options are enabled, cmpxchg on the page is disabled. This
is because the page must be locked to ensure there are no false
positives when performing consistency checks. Some debug options such
as poisoning and red zoning only act on the object itself. There is no
need to protect other CPUs from modification on only the object. Allow
cmpxchg to happen with poisoning and red zoning are set on a slab.
Credit to Mathias Krause for the original work which inspired this
series
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
on or off. Expand its use to be able to turn off all consistency
checks. This gives a nice speed up if you only want features such as
poisoning or tracing.
Credit to Mathias Krause for the original work which inspired this
series
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 19c7ff9ecd ("slub: Take node lock during object free
checks") check_object has been incorrectly returning success as it
follows the out label which just returns the node.
Thanks to refactoring, the out and fail paths are now basically the
same. Combine the two into one and just use a single label.
Credit to Mathias Krause for the original work which inspired this
series
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series takes the suggestion of Christoph Lameter and only focuses
on optimizing the slow path where the debug processing runs. The two
main optimizations in this series are letting the consistency checks be
skipped and relaxing the cmpxchg restrictions when we are not doing
consistency checks. With hackbench -g 20 -l 1000 averaged over 100
runs:
Before slub_debug=P
mean 15.607
variance .086
stdev .294
After slub_debug=P
mean 10.836
variance .155
stdev .394
This still isn't as fast as what is in grsecurity unfortunately so there's
still work to be done. Profiling ___slab_alloc shows that 25-50% of time
is spent in deactivate_slab. I haven't looked too closely to see if this
is something that can be optimized. My plan for now is to focus on
getting all of this merged (if appropriate) before digging in to another
task.
This patch (of 4):
Currently, free_debug_processing has a comment "Keep node_lock to preserve
integrity until the object is actually freed". In actuallity, the lock is
dropped immediately in __slab_free. Rather than wait until __slab_free
and potentially throw off the unlikely marking, just drop the lock in
__slab_free. This also lets free_debug_processing take its own copy of
the spinlock flags rather than trying to share the ones from __slab_free.
Since there is no use for the node afterwards, change the return type of
free_debug_processing to return an int like alloc_debug_processing.
Credit to Mathias Krause for the original work which inspired this series
[akpm@linux-foundation.org: fix build]
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current implementation of pfmemalloc handling in SLAB has some problems.
1) pfmemalloc_active is set to true when there is just one or more
pfmemalloc slabs in the system, but it is cleared when there is no
pfmemalloc slab in one arbitrary kmem_cache. So, pfmemalloc_active
could be wrongly cleared.
2) Search to partial and free list doesn't happen when non-pfmemalloc
object are not found in cpu cache. Instead, allocating new slab
happens and it is not optimal.
3) Even after sk_memalloc_socks() is disabled, cpu cache would keep
pfmemalloc objects tagged with SLAB_OBJ_PFMEMALLOC. It isn't cleared
if sk_memalloc_socks() is disabled so it could cause problem.
4) If cpu cache is filled with pfmemalloc objects, it would cause slow
down non-pfmemalloc allocation.
To me, current pointer tagging approach looks complex and fragile so this
patch re-implement whole thing instead of fixing problems one by one.
Design principle for new implementation is that
1) Don't disrupt non-pfmemalloc allocation in fast path even if
sk_memalloc_socks() is enabled. It's more likely case than pfmemalloc
allocation.
2) Ensure that pfmemalloc slab is used only for pfmemalloc allocation.
3) Don't consider performance of pfmemalloc allocation in memory
deficiency state.
As a result, all pfmemalloc alloc/free in memory tight state will be
handled in slow-path. If there is non-pfmemalloc free object, it will be
returned first even for pfmemalloc user in fast-path so that performance
of pfmemalloc user isn't affected in normal case and pfmemalloc objects
will be kept as long as possible.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Returing values by reference is bad practice. Instead, just use
function return value.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Christoph Lameter <cl@linux.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB needs an array to manage freed objects in a slab. It is only used
if some objects are freed so we can use free object itself as this
array. This requires additional branch in somewhat critical lock path
to check if it is first freed object or not but that's all we need.
Benefits is that we can save extra memory usage and reduce some
computational overhead by allocating a management array when new slab is
created.
Code change is rather complex than what we can expect from the idea, in
order to handle debugging feature efficiently. If you want to see core
idea only, please remove '#if DEBUG' block in the patch.
Although this idea can apply to all caches whose size is larger than
management array size, it isn't applied to caches which have a
constructor. If such cache's object is used for management array,
constructor should be called for it before that object is returned to
user. I guess that overhead overwhelm benefit in that case so this idea
doesn't applied to them at least now.
For summary, from now on, slab management type is determined by
following logic.
1) if management array size is smaller than object size and no ctor, it
becomes OBJFREELIST_SLAB.
2) if management array size is smaller than leftover, it becomes
NORMAL_SLAB which uses leftover as a array.
3) if OFF_SLAB help to save memory than way 4), it becomes OFF_SLAB.
It allocate a management array from the other cache so memory waste
happens.
4) others become NORMAL_SLAB. It uses dedicated internal memory in a
slab as a management array so it causes memory waste.
In my system, without enabling CONFIG_DEBUG_SLAB, Almost caches become
OBJFREELIST_SLAB and NORMAL_SLAB (using leftover) which doesn't waste
memory. Following is the result of number of caches with specific slab
management type.
TOTAL = OBJFREELIST + NORMAL(leftover) + NORMAL + OFF
/Before/
126 = 0 + 60 + 25 + 41
/After/
126 = 97 + 12 + 15 + 2
Result shows that number of caches that doesn't waste memory increase
from 60 to 109.
I did some benchmarking and it looks that benefit are more than loss.
Kmalloc: Repeatedly allocate then free test
/Before/
[ 0.286809] 1. Kmalloc: Repeatedly allocate then free test
[ 1.143674] 100000 times kmalloc(32) -> 116 cycles kfree -> 78 cycles
[ 1.441726] 100000 times kmalloc(64) -> 121 cycles kfree -> 80 cycles
[ 1.815734] 100000 times kmalloc(128) -> 168 cycles kfree -> 85 cycles
[ 2.380709] 100000 times kmalloc(256) -> 287 cycles kfree -> 95 cycles
[ 3.101153] 100000 times kmalloc(512) -> 370 cycles kfree -> 117 cycles
[ 3.942432] 100000 times kmalloc(1024) -> 413 cycles kfree -> 156 cycles
[ 5.227396] 100000 times kmalloc(2048) -> 622 cycles kfree -> 248 cycles
[ 7.519793] 100000 times kmalloc(4096) -> 1102 cycles kfree -> 452 cycles
/After/
[ 1.205313] 100000 times kmalloc(32) -> 117 cycles kfree -> 78 cycles
[ 1.510526] 100000 times kmalloc(64) -> 124 cycles kfree -> 81 cycles
[ 1.827382] 100000 times kmalloc(128) -> 130 cycles kfree -> 84 cycles
[ 2.226073] 100000 times kmalloc(256) -> 177 cycles kfree -> 92 cycles
[ 2.814747] 100000 times kmalloc(512) -> 286 cycles kfree -> 112 cycles
[ 3.532952] 100000 times kmalloc(1024) -> 344 cycles kfree -> 141 cycles
[ 4.608777] 100000 times kmalloc(2048) -> 519 cycles kfree -> 210 cycles
[ 6.350105] 100000 times kmalloc(4096) -> 789 cycles kfree -> 391 cycles
In fact, I tested another idea implementing OBJFREELIST_SLAB with
extendable linked array through another freed object. It can remove
memory waste completely but it causes more computational overhead in
critical lock path and it seems that overhead outweigh benefit. So, this
patch doesn't include it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cache_init_objs() will be changed in following patch and current form
doesn't fit well for that change. So, before doing it, this patch
separates debugging initialization. This would cause two loop iteration
when debugging is enabled, but, this overhead seems too light than debug
feature itself so effect may not be visible. This patch will greatly
simplify changes in cache_init_objs() in following patch.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab list should be fixed up after object is detached from the slab and
this happens at two places. They do exactly same thing. They will be
changed in the following patch, so, to reduce code duplication, this
patch factor out them and make it common function.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To become an off slab, there are some constraints to avoid bootstrapping
problem and recursive call. This can be avoided differently by simply
checking that corresponding kmalloc cache is ready and it's not a off
slab. It would be more robust because static size checking can be
affected by cache size change or architecture type but dynamic checking
isn't.
One check 'freelist_cache->size > cachep->size / 2' is added to check
benefit of choosing off slab, because, now, there is no size constraint
which ensures enough advantage when selecting off slab.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can fail to setup off slab in some conditions. Even in this case,
debug pagealloc increases cache size to PAGE_SIZE in advance and it is
waste because debug pagealloc cannot work for it when it isn't the off
slab. To improve this situation, this patch checks first that this
cache with increased size is suitable for off slab. It actually
increases cache size when it is suitable for off-slab, so possible waste
is removed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current cache type determination code is open-code and looks not
understandable. Following patch will introduce one more cache type and
it would make code more complex. So, before it happens, this patch
abstracts these codes.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Finding suitable OFF_SLAB candidate is more related to aligned cache
size rather than original size. Same reasoning can be applied to the
debug pagealloc candidate. So, this patch moves up alignment fixup to
proper position. From that point, size is aligned so we can remove some
alignment fixups.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the freelist is at the front of slab page. This requires
extra space to meet object alignment requirement. If we put the
freelist at the end of a slab page, objects could start at page boundary
and will be at correct alignment. This is possible because freelist has
no alignment constraint itself.
This gives us two benefits: It removes extra memory space for the
freelist alignment and remove complex calculation at cache
initialization step. I can't think notable drawback here.
I mentioned that this would reduce extra memory space, but, this benefit
is rather theoretical because it can be applied to very few cases.
Following is the example cache type that can get benefit from this
change.
size align num before after
32 8 124 4100 4092
64 8 63 4103 4095
88 8 46 4102 4094
272 8 15 4103 4095
408 8 10 4098 4090
32 16 124 4108 4092
64 16 63 4111 4095
32 32 124 4124 4092
64 32 63 4127 4095
96 32 42 4106 4074
before means whole size for objects and aligned freelist before applying
patch and after shows the result of this patch.
Since before is more than 4096, number of object should decrease and
memory waste happens.
Anyway, this patch removes complex calculation so looks beneficial to
me.
[akpm@linux-foundation.org: fix kerneldoc]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, we don't use object status buffer in any setup. Remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DEBUG_SLAB_LEAK is a debug option. It's current implementation requires
status buffer so we need more memory to use it. And, it cause
kmem_cache initialization step more complex.
To remove this extra memory usage and to simplify initialization step,
this patch implement this feature with another way.
When user requests to get slab object owner information, it marks that
getting information is started. And then, all free objects in caches
are flushed to corresponding slab page. Now, we can distinguish all
freed object so we can know all allocated objects, too. After
collecting slab object owner information on allocated objects, mark is
checked that there is no free during the processing. If true, we can be
sure that our information is correct so information is returned to user.
Although this way is rather complex, it has two important benefits
mentioned above. So, I think it is worth changing.
There is one drawback that it takes more time to get slab object owner
information but it is just a debug option so it doesn't matter at all.
To help review, this patch implements new way only. Following patch
will remove useless code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
some sites. It makes code unreadable and hard to change.
This patch cleans up this code. The following patch will change the
criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
[akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
debug_pagealloc debugging is related to SLAB_POISON flag rather than
FORCED_DEBUG option, although FORCED_DEBUG option will enable
SLAB_POISON. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of "#if DEBUG" are for reporting slab implementation bug rather
than user usecase bug. It's not really needed because slab is stable
for a quite long time and it makes code too dirty. This patch remove
it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is obsolete so remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset implements a new freed object management way, that is,
OBJFREELIST_SLAB. Purpose of it is to reduce memory overhead in SLAB.
SLAB needs a array to manage freed objects in a slab. If there is
leftover after objects are packed into a slab, we can use it as a
management array, and, in this case, there is no memory waste. But, in
the other cases, we need to allocate extra memory for a management array
or utilize dedicated internal memory in a slab for it. Both cases
causes memory waste so it's not good.
With this patchset, freed object itself can be used for a management
array. So, memory waste could be reduced. Detailed idea and numbers
are described in last patch's commit description. Please refer it.
In fact, I tested another idea implementing OBJFREELIST_SLAB with
extendable linked array through another freed object. It can remove
memory waste completely but it causes more computational overhead in
critical lock path and it seems that overhead outweigh benefit. So,
this patchset doesn't include it. I will attach prototype just for a
reference.
This patch (of 16):
We use freelist_idx_t type for free object management whose size would be
smaller than size of unsigned int. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up trivial spelling errors, noticed while reading the code.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the call to cache_alloc_debugcheck_after() outside the IRQ disabled
section in kmem_cache_alloc_bulk().
When CONFIG_DEBUG_SLAB is disabled the compiler should remove this code.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements the alloc side of bulk API for the SLAB allocator.
Further optimization are still possible by changing the call to
__do_cache_alloc() into something that can return multiple objects.
This optimization is left for later, given end results already show in
the area of 80% speedup.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewers notice that the order in slab_post_alloc_hook() of
kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
compared to slab.c / SLAB allocator.
Also notice memset now occurs before calling kmemcheck_slab_alloc() and
kmemleak_alloc_recursive().
I assume this reordering of kmemcheck, kmemleak and memset is okay
because this is the order they are used by the SLUB allocator.
This patch completes the sharing of alloc_hook's between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the SLAB allocator kmemcheck_slab_alloc() is guarded against being
called in case the object is NULL. In SLUB allocator this NULL pointer
invocation can happen, which seems like an oversight.
Move the NULL pointer check into kmemcheck code (kmemcheck_slab_alloc)
so the check gets moved out of the fastpath, when not compiled with
CONFIG_KMEMCHECK.
This is a step towards sharing post_alloc_hook between SLUB and SLAB,
because slab_post_alloc_hook() does not perform this check before
calling kmemcheck_slab_alloc().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deduplicate code in SLAB allocator functions slab_alloc() and
slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
shared between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the SLAB specific function slab_should_failslab(), by moving the
check against fault-injection for the bootstrap slab, into the shared
function should_failslab() (used by both SLAB and SLUB).
This is a step towards sharing alloc_hook's between SLUB and SLAB.
This bootstrap slab "kmem_cache" is used for allocating struct
kmem_cache objects to the allocator itself.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
First step towards sharing alloc_hook's between SLUB and SLAB
allocators. Move the SLUB allocators *_alloc_hook to the common
mm/slab.h for internal slab definitions.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change is primarily an attempt to make it easier to realize the
optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
enabled.
Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
overhead is zero. This is because, as long as no process have enabled
kmem cgroups accounting, the assignment is replaced by asm-NOP
operations. This is possible because memcg_kmem_enabled() uses a
static_key_false() construct.
It also helps readability as it avoid accessing the p[] array like:
p[size - 1] which "expose" that the array is processed backwards inside
helper function build_detached_freelist().
Lastly this also makes the code more robust, in error case like passing
NULL pointers in the array. Which were previously handled before commit
033745189b ("slub: add missing kmem cgroup support to
kmem_cache_free_bulk").
Fixes: 033745189b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-6.0 found an ancient bug in the paride driver, which had a
"module_param(verbose, bool, 0);" since before 2.6.12, but actually uses
it to accept '0', '1' or '2' as arguments:
drivers/block/paride/pd.c: In function 'pd_init_dev_parms':
drivers/block/paride/pd.c:298:29: warning: comparison of constant '1' with boolean expression is always false [-Wbool-compare]
#define DBMSG(msg) ((verbose>1)?(msg):NULL)
In 2012, Rusty did a cleanup patch that also changed the type of the
variable to 'bool', which introduced what is now a gcc warning.
This changes the type back to 'int' and adapts the module_param() line
instead, so it should work as documented in case anyone ever cares about
running the ancient driver with debugging.
Fixes: 90ab5ee941 ("module_param: make bool parameters really bool (drivers & misc)")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Rusty Russell <rusty@rustcorp.com.au>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch has been carried in the Android tree for quite some time and
is one of the few patches required to get a mainline kernel up and
running with an exsiting Android userspace. So I wanted to submit it
for review and consideration if it should be merged.
For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.
Android's userspace uses this for creating device node links from the
partition name and number, ie:
/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1
One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494
[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
Signed-off-by: Dima Zavin <dima@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Rom Lemarchand <romlem@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <harald@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_send_join_cancels(), node is defined with type unsigned int, but
initialized with -1, this will lead variable overflow. Although this
won't cause any runtime problem, the code looks a little uncoordinated.
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
when o2hb detect a node down, it first set the dead node to recovery map
and create ocfs2rec which will replay journal for dead node. o2hb
thread then call dlm_do_local_recovery_cleanup() to delete the lock for
dead node. After the lock of dead node is gone, locks for other nodes
can be granted and may modify the meta data without replaying journal of
the dead node. The detail is described as follows.
N1 N2 N3(master)
modify the extent tree of
inode, and commit
dirty metadata to journal,
then goes down.
o2hb thread detects
N1 goes down, set
recovery map and
delete the lock of N1.
dlm_thread flush ast
for the lock of N2.
do not detect the death
of N1, so recovery map is
empty.
read inode from disk
without replaying
the journal of N1 and
modify the extent tree
of the inode that N1
had modified.
ocfs2rec recover the
journal of N1.
The modification of N2
is lost.
The modification of N1 and N2 are not serial, and it will lead to
read-only file system. We can set recovery_waiting flag to the lock
resource after delete the lock for dead node to prevent other node from
getting the lock before dlm recovery. After dlm recovery, the recovery
map on N2 is not empty, ocfs2_inode_lock_full_nested() will wait for ocfs2
recovery.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If master migrate this lock resource to node when it happened to purge
it, a new lock resource will be created and inserted into hash list. If
then master goes down, the lock resource being purged is recovered, so
there exist two lock resource with different owner. So return error to
master if the lock resource is in DROPPING state, master will retry to
migrate this lock resource.
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the master goes down after return in-progress for deref message. The
lock resource on non-master node can not be purged. Clear the
DROPPING_REF flag and recovery it.
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Master returns in-progress to non-master node when it can not clear the
refmap bit right now. And non-master node will not purge the lock
resource until receiving deref done message.
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches is to fix the dis-order issue of setting/clearing
refmap bit described below.
Node 1 Node 2(master)
dlmlock
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
dlmlock succeed
dlmunlock succeed
dlm_purge_lockres
dlm_deref_handler
-> find lock resource is in
DLM_LOCK_RES_SETREF_INPROG state,
so dispatch a deref work
dlm_purge_lockres succeed.
call dlmlock again
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
deref work trigger, call
dlm_lockres_clear_refmap_bit
to clear Node 1 from refmap
dlm_purge_lockres succeed
dlm_send_remote_lock_request
return DLM_IVLOCKID because
the lockres is not exist
BUG if the lockres is $RECOVERY
This series of patches add a new message to keep the order of set and
clear. Other nodes can purge the lock resource only after the refmap bit
on master is cleared.
This patch is to add DEREF_DONE message and corresponding handler. Node
can purge the lock resource after receiving this message. As a new
message is added, so increase the minor number of dlm protocol version.
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refer to cluster/tcp.h, NET_MAX_PAYLOAD_BYTES is a typo for
O2NET_MAX_PAYLOAD_BYTES.
Since currently DLM_MIG_LOCKRES_RESERVED is not actually used, it won't
cause any problem. But we'd better correct it for further use.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a75e9ccabd ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
ocfs2_wake_downconvert_thread
ocfs2_rw_unlock
ocfs2_dio_end_io
dio_complete
.....
bio_endio
req_bio_endio
....
scsi_io_completion
blk_done_softirq
__do_softirq
do_softirq
irq_exit
do_IRQ
ocfs2_osb_dump
cat /sys/kernel/debug/ocfs2/${uuid}/fs_state
This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There actually no hardware or software interrupts in the context which
using o2hb_live_lock, so we don't need to worry about race conditions
caused by irq/softirq with spinlock held. Turning off irq is not good
for system performance after all. Just replace them with a non
interrupt safe function.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
blackfin allmodconfig build fails with the error:
../sound/core/pcm_native.c: In function 'snd_pcm_lib_default_mmap':
../sound/core/pcm_native.c:3386:24: error: implicit declaration of function 'pgprot_writecombine' [-Werror=implicit-function-declaration]
area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
^
../sound/core/pcm_native.c:3386:22: error: incompatible types when assigning to type 'pgprot_t {aka struct <anonymous>}' from type 'int'
area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
^
When !MMU, asm-generic will not define default pgprot_writecombine, so
blackfin needs to define it by itself.
The patch idea is from commit 65b9ab888c ("arch/c6x/include/asm/pgtable.h:
define dummy pgprot_writecombine for !MMU")
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Cc: Steven Miao <realmz6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>