linux/mm
Roman Gushchin c03914b7aa mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
Patch series "mm: reparent slab memory on cgroup removal", v7.

# Why do we need this?

We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production.  The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.

The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged
pages are gone.  If a slab object is actively used by other cgroups, it
won't be reclaimed, and will prevent the origin cgroup from being
reclaimed.

Slab objects, and first of all vfs cache, is shared between cgroups, which
are using the same underlying fs, and what's even more important, it's
shared between multiple generations of the same workload.  So if something
is running periodically every time in a new cgroup (like how systemd
works), we do accumulate multiple dying cgroups.

Strictly speaking pagecache isn't different here, but there is a key
difference: we disable protection and apply some extra pressure on LRUs of
dying cgroups, and these LRUs contain all charged pages.  My experiments
show that with the disabled kernel memory accounting the number of dying
cgroups stabilizes at a relatively small number (~100, depends on memory
pressure and cgroup creation rate), and with kernel memory accounting it
grows pretty steadily up to several thousands.

Memory cgroups are quite complex and big objects (mostly due to percpu
stats), so it leads to noticeable memory losses.  Memory occupied by dying
cgroups is measured in hundreds of megabytes.  I've even seen a host with
more than 100Gb of memory wasted for dying cgroups.  It leads to a
degradation of performance with the uptime, and generally limits the usage
of cgroups.

My previous attempt [3] to fix the problem by applying extra pressure on
slab shrinker lists caused a regressions with xfs and ext4, and has been
reverted [4].  The following attempts to find the right balance [5, 6]
were not successful.

So instead of trying to find a maybe non-existing balance, let's do
reparent accounted slab caches to the parent cgroup on cgroup removal.

# Implementation approach

There is however a significant problem with reparenting of slab memory:
there is no list of charged pages.  Some of them are in shrinker lists,
but not all.  Introducing of a new list is really not an option.

But fortunately there is a way forward: every slab page has a stable
pointer to the corresponding kmem_cache.  So the idea is to reparent
kmem_caches instead of slab pages.

It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
   instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
   page->kmem_cache->memcg indirection instead. It's used only on
   slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
   be able to destroy kmem_caches when they become empty and release
   the associated memory cgroup.

There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself.  This patchset allows individual
kmem_caches to be released as soon as they become inactive and free.

Some additional implementation details are provided in corresponding
commit messages.

# Results

Below is the average number of dying cgroups on two groups of our
production hosts.  They do run some sort of web frontend workload, the
memory pressure is moderate.  As we can see, with the kernel memory
reparenting the number stabilizes in 60s range; however with the original
version it grows almost linearly and doesn't show any signs of plateauing.
The difference in slab and percpu usage between patched and unpatched
versions also grows linearly.  In 7 days it exceeded 200Mb.

day           0    1    2    3    4    5    6    7
original     56  362  628  752 1070 1250 1490 1560
patched      23   46   51   55   60   57   67   69
mem diff(Mb) 22   74  123  152  164  182  214  241

# Links

[1]: commit 68600f623d ("mm: don't miss the last page because of round-off error")
[2]: commit 9b6f7e163c ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b ("mm: slowly shrink slabs with a relatively small number of objects")
[4]: commit a9a238e83f ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2

This patch (of 10):

Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
rather than in init_memcg_params().

Once kmem_cache will hold a reference to the memory cgroup, it will
simplify the refcounting.

For non-root kmem_caches memcg_link_cache() is always called before the
kmem_cache becomes visible to a user, so it's safe.

Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
..
kasan mm/kasan: change kasan_check_{read,write} to return boolean 2019-07-12 11:05:42 -07:00
backing-dev.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
balloon_compaction.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
cleancache.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
cma_debug.c mm/cma_debug.c: fix the break condition in cma_maxchunk_get() 2019-05-14 09:47:45 -07:00
cma.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98 2019-05-24 17:37:54 +02:00
cma.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
compaction.c mm, compaction: make sure we isolate a valid PFN 2019-06-01 15:51:32 -07:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
debug.c mm: update references to page _refcount 2019-05-14 19:52:47 -07:00
dmapool.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 403 2019-06-05 17:37:13 +02:00
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm/filemap.c: correct the comment about VM_FAULT_RETRY 2019-07-12 11:05:43 -07:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
gup_benchmark.c mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM 2019-05-14 09:47:45 -07:00
gup.c mm/gup.c: make follow_page_mask() static 2019-07-12 11:05:42 -07:00
highmem.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
hmm.c mm/devm_memremap_pages: fix final page put race 2019-06-13 17:34:56 -10:00
huge_memory.c Revert "mm: page cache: store only head pages in i_pages" 2019-07-05 19:55:18 -07:00
hugetlb_cgroup.c mm: rename page_counter's count/limit into usage/max 2018-06-07 17:34:35 -07:00
hugetlb.c mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge 2019-06-29 16:43:45 +08:00
hwpoison-inject.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
init-mm.c mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids 2018-07-17 09:35:30 +02:00
internal.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
Kconfig Linux 5.2-rc4 2019-06-14 14:18:53 -06:00
Kconfig.debug mm, debug_pagealloc: use a page type instead of page_ext flag 2019-07-12 11:05:43 -07:00
khugepaged.c Revert "mm: page cache: store only head pages in i_pages" 2019-07-05 19:55:18 -07:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c mm/kmemleak.c: change error at _write when kmemleak is disabled 2019-07-12 11:05:42 -07:00
ksm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
list_lru.c mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node 2019-06-13 17:34:56 -10:00
maccess.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
madvise.c mm/mmu_notifier: use correct mmu_notifier events for each invalidation 2019-05-14 09:47:49 -07:00
Makefile mm: shuffle initial free memory to improve memory-side-cache utilization 2019-05-14 19:52:48 -07:00
memblock.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
memcontrol.c mm: memcontrol: dump memory.stat during cgroup OOM 2019-07-12 11:05:43 -07:00
memfd.c Revert "mm: page cache: store only head pages in i_pages" 2019-07-05 19:55:18 -07:00
memory_hotplug.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
memory-failure.c Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2019-07-08 21:48:15 -07:00
memory.c mm, swap: fix race between swapoff and some swap operations 2019-07-12 11:05:43 -07:00
mempolicy.c mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask 2019-06-29 16:43:44 +08:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c Revert "mm: page cache: store only head pages in i_pages" 2019-07-05 19:55:18 -07:00
mincore.c mm/mincore.c: fix race between swapoff and mincore 2019-07-12 11:05:43 -07:00
mlock.c mm/mlock.c: change count_mm_mlocked_page_nr return type 2019-06-13 17:34:56 -10:00
mm_init.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
mmap.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
mmu_context.c
mmu_gather.c mm: mmu_gather: remove __tlb_reset_range() for force flush 2019-06-13 17:34:56 -10:00
mmu_notifier.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 2019-06-19 17:09:53 +02:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c mm/mprotect.c: fix compilation warning because of unused 'mm' variable 2019-05-14 09:47:51 -07:00
mremap.c mm/mmu_notifier: contextual information for event triggering invalidation 2019-05-14 09:47:49 -07:00
msync.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nommu.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
oom_kill.c mm/oom_kill.c: fix uninitialized oc->constraint 2019-06-29 16:43:45 +08:00
page_alloc.c mm, debug_pagealloc: use a page type instead of page_ext flag 2019-07-12 11:05:43 -07:00
page_counter.c memcg: introduce memory.min 2018-06-07 17:34:36 -07:00
page_ext.c mm, debug_pagealloc: use a page type instead of page_ext flag 2019-07-12 11:05:43 -07:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-06-29 16:43:45 +08:00
page_io.c mm, swap: use rbtree for swap_extent 2019-07-12 11:05:43 -07:00
page_isolation.c mm/page_isolation.c: change the prototype of undo_isolate_page_range() 2019-07-12 11:05:43 -07:00
page_owner.c mm/page_owner: Simplify stack trace handling 2019-04-29 12:37:50 +02:00
page_poison.c page_poison: play nicely with KASAN 2019-03-05 21:07:13 -08:00
page_vma_mapped.c mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly 2018-10-31 08:54:11 -07:00
page-writeback.c mm: remove the account_page_dirtied export 2019-07-12 11:05:42 -07:00
pagewalk.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
percpu-internal.h percpu: convert chunk hints to be based on pcpu_block_md 2019-03-13 12:25:31 -07:00
percpu-km.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-stats.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-vm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
pgtable-generic.c x86/mm: Page size aware flush_tlb_mm_range() 2018-10-09 16:51:11 +02:00
process_vm_access.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
quicklist.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
readahead.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
rmap.c mm/rmap.c: use the pra.mapcount to do the check 2019-05-14 09:47:49 -07:00
rodata_test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 2019-06-05 17:37:17 +02:00
shmem.c Revert "mm: page cache: store only head pages in i_pages" 2019-07-05 19:55:18 -07:00
shuffle.c mm: maintain randomization of page free lists 2019-05-14 19:52:48 -07:00
shuffle.h mm: maintain randomization of page free lists 2019-05-14 19:52:48 -07:00
slab_common.c mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() 2019-07-12 11:05:43 -07:00
slab.c mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() 2019-07-12 11:05:43 -07:00
slab.h mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() 2019-07-12 11:05:43 -07:00
slob.c mm/slab: refactor common ksize KASAN logic into slab_common.c 2019-07-12 11:05:42 -07:00
slub.c mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache() 2019-07-12 11:05:43 -07:00
sparse-vmemmap.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sparse.c mm/sparse.c: clean up obsolete code comment 2019-05-14 09:47:48 -07:00
swap_cgroup.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
swap_slots.c mm, swap, get_swap_pages: use entry_size instead of cluster in parameter 2018-08-22 10:52:44 -07:00
swap_state.c mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device() 2019-07-12 11:05:43 -07:00
swap.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
swapfile.c mm, swap: use rbtree for swap_extent 2019-07-12 11:05:43 -07:00
truncate.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
usercopy.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
userfaultfd.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 2019-06-19 17:09:53 +02:00
util.c prctl_set_mm: downgrade mmap_sem to read lock 2019-06-01 15:51:31 -07:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c arm64 updates for 5.3: 2019-07-08 09:54:55 -07:00
vmpressure.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
vmscan.c mm: vmscan: scan anonymous pages on file refaults 2019-07-12 11:05:39 -07:00
vmstat.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
workingset.c mm: memcontrol: make cgroup stats and events query API explicitly local 2019-05-14 19:52:53 -07:00
z3fold.c mm/z3fold.c: lock z3fold page before __SetPageMovable() 2019-07-12 11:05:40 -07:00
zbud.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zpool.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zsmalloc.c mm/zsmalloc.c: fix fall-through annotation 2018-10-26 16:26:35 -07:00
zswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157 2019-05-30 11:26:37 -07:00