linux/mm
Roman Gushchin 4a87e2a25d mm: memcg/slab: fix percpu slab vmstats flushing
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure.  Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.

The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels.  It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed.  And with uptime the
error on upper levels might become noticeable.

The first flushing aims to make counters on ancestor levels more
precise.  Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level.  It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup.  By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.

The problem is that percpu counters are not zeroed after the first
flushing.  So every cached percpu value is summed twice.  It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level.  After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.

For now, let's just stop flushing slab counters on memcg offlining.  It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.

With this change, slab counters on parent level will become eventually
consistent.  Once all dying children are gone, values are correct.  And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.

It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level.  It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released.  It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again.  But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.

We might also consider flushing all counters on offlining, not only slab
counters.

So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups).  And think about
the accuracy of counters separately.

Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-13 18:19:02 -08:00
..
kasan kasan: use apply_to_existing_page_range() for releasing vmalloc shadow 2019-12-17 20:59:59 -08:00
backing-dev.c bdi: Do not use freezable workqueue 2019-10-06 09:11:35 -06:00
balloon_compaction.c mm/balloon_compaction: suppress allocation warnings 2019-09-04 07:42:01 -04:00
cleancache.c Driver Core and debugfs changes for 5.3-rc1 2019-07-12 12:24:03 -07:00
cma_debug.c mm/cma_debug.c: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
cma.c mm/cma.c: switch to bitmap_zalloc() for cma bitmap allocation 2019-12-01 12:59:09 -08:00
cma.h
compaction.c mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() 2019-10-14 15:04:01 -07:00
debug_page_ref.c
debug.c mm/debug.c: PageAnon() is true for PageKsm() pages 2019-11-15 18:34:00 -08:00
dmapool.c mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options 2019-07-12 11:05:46 -07:00
early_ioremap.c
fadvise.c fs: Export generic_fadvise() 2019-08-30 22:43:58 -07:00
failslab.c mm/failslab.c: by default, do not fail allocations with direct reclaim only 2019-07-12 11:05:43 -07:00
filemap.c mm: drop mmap_sem before calling balance_dirty_pages() in write fault 2019-12-01 06:29:18 -08:00
frame_vector.c mm: untag user pointers in get_vaddr_frames 2019-09-25 17:51:41 -07:00
frontswap.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 2019-06-19 17:09:52 +02:00
gup_benchmark.c mm/gup: fix memory leak in __gup_benchmark_ioctl 2020-01-04 13:55:09 -08:00
gup.c mm/gup.c: fix comments of __get_user_pages() and get_user_pages_remote() 2019-12-01 06:29:18 -08:00
highmem.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
hmm.c mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap 2019-11-23 19:56:45 -04:00
huge_memory.c mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment 2020-01-13 18:19:01 -08:00
hugetlb_cgroup.c mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() 2019-11-15 18:34:00 -08:00
hugetlb.c mm/hugetlb: defer freeing of huge pages if in non-task context 2020-01-04 13:55:09 -08:00
hwpoison-inject.c mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops 2019-12-01 12:59:09 -08:00
init-mm.c mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch 2019-10-19 06:32:32 -04:00
internal.h mm, pcpu: make zone pcp updates and reset internal to the mm 2019-12-01 12:59:06 -08:00
interval_tree.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248 2019-06-19 17:09:08 +02:00
Kconfig mm/Kconfig: fix trivial help text punctuation 2019-12-01 12:59:10 -08:00
Kconfig.debug mm, page_owner, debug_pagealloc: save and dump freeing stack trace 2019-09-24 15:54:08 -07:00
khugepaged.c mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() 2019-12-01 12:59:09 -08:00
kmemleak-test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333 2019-06-05 17:37:06 +02:00
kmemleak.c kmemleak: Do not corrupt the object_list during clean-up 2019-10-14 08:56:16 -07:00
ksm.c * PPC secure guest support 2019-12-04 11:08:30 -08:00
list_lru.c mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages 2019-07-12 11:05:44 -07:00
maccess.c uaccess: Add strict non-pagefault kernel-space read function 2019-11-02 12:39:12 -07:00
madvise.c mm/madvise.c: use PAGE_ALIGN[ED] for range checking 2019-12-01 12:59:09 -08:00
Makefile mm: Add write-protect and clean utilities for address space ranges 2019-11-06 13:03:36 +01:00
mapping_dirty_helpers.c mm: Add write-protect and clean utilities for address space ranges 2019-11-06 13:03:36 +01:00
memblock.c mm: support memblock alloc on the exact node for sparse_buffer_init() 2019-12-01 12:59:08 -08:00
memcontrol.c mm: memcg/slab: fix percpu slab vmstats flushing 2020-01-13 18:19:02 -08:00
memfd.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
memory_hotplug.c mm/memory_hotplug: shrink zones when offlining memory 2020-01-04 13:55:08 -08:00
memory-failure.c mm/memory-failure.c: use page_shift() in add_to_kill() 2019-12-01 12:59:04 -08:00
memory.c mm/memory.c: add apply_to_existing_page_range() helper 2019-12-17 20:59:59 -08:00
mempolicy.c mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations 2020-01-13 18:19:01 -08:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memremap.c mm/memory_hotplug: shrink zones when offlining memory 2020-01-04 13:55:08 -08:00
memtest.c
migrate.c mm: move_pages: return valid node id in status if the page is already on the target node 2020-01-04 13:55:09 -08:00
mincore.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mlock.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
mm_init.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
mmap.c arm64: Revert support for execute-only user mappings 2020-01-06 10:10:07 -08:00
mmu_context.c
mmu_gather.c mm: remove quicklist page table caches 2019-09-24 15:54:09 -07:00
mmu_notifier.c hmm related patches for 5.5 2019-11-30 10:33:14 -08:00
mmzone.c
mprotect.c autonuma: reduce cache footprint when scanning page tables 2019-12-01 12:59:09 -08:00
mremap.c mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area 2019-12-01 06:29:19 -08:00
msync.c mm: untag user pointers passed to memory syscalls 2019-09-25 17:51:41 -07:00
nommu.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
oom_kill.c mm/oom: fix pgtables units mismatch in Killed process message 2020-01-04 13:55:09 -08:00
page_alloc.c mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations 2020-01-13 18:19:01 -08:00
page_counter.c
page_ext.c mm, page_owner: fix off-by-one error in __set_page_owner_handle() 2019-10-14 15:04:00 -07:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-06-29 16:43:45 +08:00
page_io.c mm/page_io.c: annotate refault stalls from swap_readpage 2019-12-01 12:59:11 -08:00
page_isolation.c mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE 2019-12-01 12:59:04 -08:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-19 06:32:31 -04:00
page_poison.c mm/page_poison.c: fix a typo in a comment 2019-09-24 15:54:08 -07:00
page_vma_mapped.c mm: introduce page_size() 2019-09-24 15:54:08 -07:00
page-writeback.c writeback, memcg: Implement foreign dirty flushing 2019-08-27 09:22:38 -06:00
pagewalk.c mm: Add a walk_page_mapping() function to the pagewalk code 2019-11-06 13:02:43 +01:00
percpu-internal.h percpu: convert chunk hints to be based on pcpu_block_md 2019-03-13 12:25:31 -07:00
percpu-km.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-stats.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu-vm.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
percpu.c percpu: Use struct_size() helper 2019-09-04 13:40:49 -07:00
pgtable-generic.c asm-generic/mm: stub out p{4,u}d_clear_bad() if __PAGETABLE_P{4,U}D_FOLDED 2019-12-01 06:29:19 -08:00
process_vm_access.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
readahead.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
rmap.c mm, thp: do not queue fully unmapped pages for deferred split 2019-12-01 12:59:09 -08:00
rodata_test.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 2019-06-05 17:37:17 +02:00
shmem.c mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment 2020-01-13 18:19:01 -08:00
shuffle.c mm: fix -Wmissing-prototypes warnings 2019-10-07 15:47:19 -07:00
shuffle.h mm: maintain randomization of page free lists 2019-05-14 19:52:48 -07:00
slab_common.c mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction 2019-12-04 19:44:11 -08:00
slab.c mm, slab: remove unused kmalloc_size() 2019-12-01 06:29:17 -08:00
slab.h mm: clean up and clarify lruvec lookup procedure 2019-12-01 12:59:06 -08:00
slob.c mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) 2019-10-07 15:47:20 -07:00
slub.c mm/slub.c: clean up validate_slab() 2019-12-01 06:29:18 -08:00
sparse-vmemmap.c mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() 2019-07-18 17:08:07 -07:00
sparse.c mm/memory_hotplug: don't free usage map when removing a re-added early section 2020-01-13 18:19:01 -08:00
swap_cgroup.c
swap_slots.c
swap_state.c mm: page cache: store only head pages in i_pages 2019-09-24 15:54:08 -07:00
swap.c mm/swap.c: piggyback lru_add_drain_all() calls 2019-12-01 06:29:19 -08:00
swapfile.c mm, swap: disallow swapon() on zoned block devices 2019-12-01 06:29:18 -08:00
truncate.c mm/thp: allow dropping THP from page cache 2019-10-19 06:32:33 -04:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-09-17 15:20:17 -07:00
userfaultfd.c mm: fix typos in comments when calling __SetPageUptodate() 2019-12-01 12:59:10 -08:00
util.c mm/mmap.c: rb_parent is not necessary in __vma_link_list() 2019-12-01 06:29:19 -08:00
vmacache.c
vmalloc.c kasan: don't assume percpu shadow allocations will succeed 2019-12-17 20:59:59 -08:00
vmpressure.c mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() 2019-10-07 15:47:19 -07:00
vmscan.c mm: vmscan: protect shrinker idr replace with CONFIG_MEMCG 2019-12-17 20:59:59 -08:00
vmstat.c mm/memcontrol: use vmstat names for printing statistics 2019-12-04 19:44:11 -08:00
workingset.c mm: vmscan: detect file thrashing at the reclaim root 2019-12-01 12:59:07 -08:00
z3fold.c mm/z3fold.c: add inter-page compaction 2019-12-01 12:59:07 -08:00
zbud.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
zpool.c zpool: add malloc_support_movable to zpool_driver 2019-09-24 15:54:12 -07:00
zsmalloc.c mm/zsmalloc.c: fix the migrated zspage statistics. 2020-01-04 13:55:09 -08:00
zswap.c zswap: do not map same object twice 2019-09-24 15:54:12 -07:00