linux/mm
Muchun Song 6a6b7b77cc mm: list_lru: transpose the array of per-node per-memcg lru lists
Patch series "Optimize list lru memory consumption", v6.

In our server, we found a suspected memory leak problem.  The kmalloc-32
consumes more than 6GB of memory.  Other kmem_caches consume less than
2GB memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p
  memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
MB (~5.6GB).  But now the number of memory cgroups is less than 500.  So
I guess more than 12286 memory cgroups have been created on this machine
(I do not know why there are so many cgroups, it may be a user's bug or
the user really want to do that).  Because memcg_nr_cache_ids has not
been reduced to a suitable value.  It leads to waste a lot of memory.
If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server.
This is not what we want.

In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
this.  But this did not fundamentally solve the problem.

We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.

For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.

What it comes down to is that the list_lru is only needed for a given
memcg if that memcg is instatiating and freeing objects on a given
list_lru.

As Dave said, "Which makes me think we should be moving more towards 'add
the memcg to the list_lru at the first insert' model rather than
'instantiate all at memcg init time just in case'."

This patchset aims to optimize the list lru memory consumption from
different aspects.

I had done a easy test to show the optimization.  I create 10k memory
cgroups and mount 10k filesystems in the systems.  We use free command to
show how many memory does the systems comsumes after this operation (There
are 2 numa nodes in the system).

        +-----------------------+------------------------+
        |      condition        |   memory consumption   |
        +-----------------------+------------------------+
        | without this patchset |        24464 MB        |
        +-----------------------+------------------------+
        |     after patch 1     |        21957 MB        | <--------+
        +-----------------------+------------------------+          |
        |     after patch 10    |         6895 MB        |          |
        +-----------------------+------------------------+          |
        |     after patch 12    |         4367 MB        |          |
        +-----------------------+------------------------+          |
                                                                    |
        The more the number of nodes, the more obvious the effect---+

BTW, there was a recent discussion [2] on the same issue.

[1] https://lore.kernel.org/all/20210428094949.43579-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/all/20210405054848.GA1077931@in.ibm.com/

This series not only optimizes the memory usage of list_lru but also
simplifies the code.

This patch (of 16):

The current scheme of maintaining per-node per-memcg lru lists looks like:
  struct list_lru {
    struct list_lru_node *node;           (for each node)
      struct list_lru_memcg *memcg_lrus;
        struct list_lru_one *lru[];       (for each memcg)
  }

By effectively transposing the two-dimension array of list_lru_one's structures
(per-node per-memcg => per-memcg per-node) it's possible to save some memory
and simplify alloc/dealloc paths. The new scheme looks like:
  struct list_lru {
    struct list_lru_memcg *mlrus;
      struct list_lru_per_memcg *mlru[];  (for each memcg)
        struct list_lru_one node[0];      (for each node)
  }

Memory savings are coming from not only 'struct rcu_head' but also some
pointer arrays used to store the pointer to 'struct list_lru_one'.  The
array is per node and its size is 8 (a pointer) * num_memcgs.  So the
total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids.  After
this patch, the size becomes 8 * memcg_nr_cache_ids.

Link: https://lkml.kernel.org/r/20220228122126.37293-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220228122126.37293-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
..
damon mm/damon: hide kernel pointer from tracepoint event 2022-01-15 16:30:33 +02:00
kasan lib/stackdepot: always do filter_irq_stacks() in stack_depot_save() 2022-01-22 08:33:38 +02:00
kfence kfence: make test case compatible with run time set sample interval 2022-02-11 17:55:00 -08:00
backing-dev.c remove congestion tracking framework 2022-03-22 15:57:01 -07:00
balloon_compaction.c
bootmem_info.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
cma_debug.c
cma_sysfs.c
cma.c memblock: rename memblock_free to memblock_phys_free 2021-11-06 13:30:41 -07:00
cma.h
compaction.c mm: compaction: fix the migration stats in trace_mm_compaction_migratepages() 2022-01-15 16:30:30 +02:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: remove pte entry from the page table 2022-02-04 09:25:04 -08:00
debug.c mm,fs: split dump_mapping() out from dump_page() 2022-01-15 16:30:26 +02:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
fadvise.c remove inode_congested() 2022-03-22 15:57:01 -07:00
failslab.c
filemap.c tmpfs: do not allocate pages on read 2022-03-22 15:57:02 -07:00
folio-compat.c filemap: Add filemap_release_folio() 2022-01-04 13:15:34 -05:00
frontswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00
gup_test.c
gup_test.h
gup.c mm/gup: remove unused get_user_pages_locked() 2022-03-22 15:57:01 -07:00
highmem.c Fixes for 5.16 folios: 2021-11-25 10:13:56 -08:00
hmm.c mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault 2022-01-15 16:30:31 +02:00
huge_memory.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
hugetlb_cgroup.c hugetlb: add hugetlb.*.numa_stat file 2022-01-15 16:30:29 +02:00
hugetlb_vmemmap.c mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
hugetlb_vmemmap.h mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate 2021-06-30 20:47:25 -07:00
hugetlb.c hugetlbfs: fix a truncation issue in hugepages parameter 2022-02-26 09:51:17 -08:00
hwpoison-inject.c mm: hwpoison: don't drop slab caches for offlining non-LRU page 2021-09-03 09:58:15 -07:00
init-mm.c mm: add setup_initial_init_mm() helper 2021-07-08 11:48:21 -07:00
internal.h Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
interval_tree.c
io-mapping.c
ioremap.c mm: move ioremap_page_range to vmalloc.c 2021-09-08 11:50:24 -07:00
Kconfig mm: hide the FRONTSWAP Kconfig symbol 2022-01-22 08:33:38 +02:00
Kconfig.debug mm: page table check 2022-01-15 16:30:28 +02:00
khugepaged.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
kmemleak.c mm/kmemleak: avoid scanning potential huge holes 2022-02-04 09:25:05 -08:00
ksm.c mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy 2022-01-15 16:30:31 +02:00
list_lru.c mm: list_lru: transpose the array of per-node per-memcg lru lists 2022-03-22 15:57:03 -07:00
maccess.c ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault 2021-08-20 11:39:25 +01:00
madvise.c mm: fix use-after-free when anon vma name is used after vma is freed 2022-03-05 11:08:32 -08:00
Makefile mm: remove cleancache 2022-01-22 08:33:38 +02:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock: use kfree() to release kmalloced memblock regions 2022-02-20 08:45:39 +02:00
memcontrol.c mm/memcg: disable migration instead of preemption in drain_all_stock(). 2022-03-22 15:57:03 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory_hotplug.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
memory-failure.c memory-failure: fetch compound_head after pgmap_pfn_valid() 2022-01-30 09:56:58 +02:00
memory.c Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
mempolicy.c mm: change lookup_node() to use get_user_pages_fast() 2022-03-22 15:57:01 -07:00
mempool.c mm: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
memremap.c mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory 2022-03-22 15:57:01 -07:00
memtest.c
migrate.c mm/gup: follow_pfn_pte(): -EEXIST cleanup 2022-03-22 15:57:01 -07:00
mincore.c
mlock.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mm_init.c
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmap.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mmu_gather.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
mmu_notifier.c
mmzone.c
mprotect.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mremap.c mm, hugepages: add mremap() support for hugepage backed vma 2021-11-06 13:30:39 -07:00
msync.c
nommu.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
oom_kill.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
page_alloc.c Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm: make some vars and functions static or __init 2022-01-15 16:30:31 +02:00
page_idle.c mm/idle_page_tracking: make PG_idle reusable 2021-09-08 11:50:24 -07:00
page_io.c delayacct: support swapin delay accounting for swapping without blkio 2022-01-20 08:52:55 +02:00
page_isolation.c Revert "mm/page_isolation: unset migratetype directly for non Buddy page" 2022-02-04 09:25:04 -08:00
page_owner.c lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() 2022-01-22 08:33:37 +02:00
page_poison.c
page_reporting.c mm/page_reporting: allow driver to specify reporting order 2021-06-29 10:53:47 -07:00
page_reporting.h
page_table_check.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
page_vma_mapped.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
page-writeback.c mm/writeback: minor clean up for highmem_dirtyable_memory 2022-03-22 15:57:01 -07:00
pagewalk.c mm: pagewalk: fix walk for hugepage tables 2021-06-29 10:53:49 -07:00
percpu-internal.h mm: memcg/percpu: account extra objcg space to memory cgroups 2022-01-15 16:30:31 +02:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c bitmap patches for 5.17-rc1 2022-01-23 06:20:44 +02:00
pgalloc-track.h
pgtable-generic.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
process_vm_access.c
ptdump.c
readahead.c remove inode_congested() 2022-03-22 15:57:01 -07:00
rmap.c mm/rmap: fix potential batched TLB flush race 2022-01-15 16:30:31 +02:00
rodata_test.c
secretmem.c mm/secretmem: avoid letting secretmem_users drop to zero 2021-10-28 17:18:55 -07:00
shmem.c mm: shmem: use helper macro __ATTR_RW 2022-03-22 15:57:02 -07:00
shuffle.c
shuffle.h
slab_common.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
slab.c mm/kasan: Convert to struct folio and struct slab 2022-01-06 12:26:14 +01:00
slab.h slab changes for 5.17 - part 2 2022-01-18 06:40:47 +02:00
slob.c mm/slob: Remove unnecessary page_mapcount_reset() function call 2022-01-06 12:27:28 +01:00
slub.c mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled 2022-01-06 12:26:53 +01:00
sparse-vmemmap.c mm: remove redundant smp_wmb() 2021-11-06 13:30:36 -07:00
sparse.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
swap_cgroup.c
swap_slots.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
swap_state.c mm: swap: get rid of livelock in swapin readahead 2022-03-17 11:02:13 -07:00
swap.c mm/swap: fix confusing comment in folio_mark_accessed 2022-03-22 15:57:01 -07:00
swapfile.c mm: mark swap_lock and swap_active_head static 2022-01-22 08:33:38 +02:00
truncate.c mm: remove cleancache 2022-01-22 08:33:38 +02:00
usercopy.c mm: Convert check_heap_object() to use struct slab 2022-01-06 12:25:51 +01:00
userfaultfd.c mm: shmem: don't truncate page if memory failure happens 2022-01-15 16:30:26 +02:00
util.c mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls 2022-03-04 10:00:37 -08:00
vmacache.c
vmalloc.c mm/vmalloc: be more explicit about supported gfp flags. 2022-01-15 16:30:28 +02:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c remove bdi_congested() and wb_congested() and related functions 2022-03-22 15:57:01 -07:00
vmstat.c mm/vmstat: add events for THP max_ptes_* exceeds 2022-01-15 16:30:29 +02:00
workingset.c Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
z3fold.c mm/z3fold: add kerneldoc fields for z3fold_pool 2021-07-01 11:06:03 -07:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c zsmalloc: replace get_cpu_var with local_lock 2022-01-22 08:33:37 +02:00
zswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00