linux/mm/percpu-internal.h

289 lines
7.7 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _MM_PERCPU_INTERNAL_H
#define _MM_PERCPU_INTERNAL_H
#include <linux/types.h>
#include <linux/percpu.h>
#include <linux/memcontrol.h>
/*
* pcpu_block_md is the metadata block struct.
* Each chunk's bitmap is split into a number of full blocks.
* All units are in terms of bits.
*
* The scan hint is the largest known contiguous area before the contig hint.
* It is not necessarily the actual largest contig hint though. There is an
* invariant that the scan_hint_start > contig_hint_start iff
* scan_hint == contig_hint. This is necessary because when scanning forward,
* we don't know if a new contig hint would be better than the current one.
*/
struct pcpu_block_md {
int scan_hint; /* scan hint for block */
int scan_hint_start; /* block relative starting
position of the scan hint */
int contig_hint; /* contig hint for block */
int contig_hint_start; /* block relative starting
position of the contig hint */
int left_free; /* size of free space along
the left side of the block */
int right_free; /* size of free space along
the right side of the block */
int first_free; /* block position of first free */
int nr_bits; /* total bits responsible for */
};
struct pcpuobj_ext {
#ifdef CONFIG_MEMCG
struct obj_cgroup *cgroup;
#endif
#ifdef CONFIG_MEM_ALLOC_PROFILING
union codetag_ref tag;
#endif
};
#if defined(CONFIG_MEMCG) || defined(CONFIG_MEM_ALLOC_PROFILING)
#define NEED_PCPUOBJ_EXT
#endif
struct pcpu_chunk {
#ifdef CONFIG_PERCPU_STATS
int nr_alloc; /* # of allocations */
size_t max_alloc_size; /* largest allocation size */
#endif
struct list_head list; /* linked to pcpu_slot lists */
percpu: replace area map allocator with bitmap The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-12 18:27:32 +00:00
int free_bytes; /* free bytes in the chunk */
struct pcpu_block_md chunk_md;
percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing When running UnixBench/Execl throughput case, false sharing is observed due to frequent read on base_addr and write on free_bytes, chunk_md. UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. It will do system call on execl frequently, and execl will call mm_init to initialize mm_struct of the process. mm_init will call __percpu_counter_init for percpu_counters initialization. Then pcpu_alloc is called to read the base_addr of pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a specified chunk. This function will update "free_bytes" and "chunk_md" to record the rest free bytes and other meta data for this chunk. Correspondingly, pcpu_free_area will also update these 2 members when free memory. Call trace from perf is as below: + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock - 53.54% 0x654278696e552f34 main __execve entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_execve do_execveat_common.isra.47 alloc_bprm mm_init __percpu_counter_init pcpu_alloc - __mutex_lock.isra.17 In current pcpu_chunk layout, `base_addr' is in the same cache line with `free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a new cacheline. With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.4-rc4, the 160 parallel score improves by 24%. The pcpu_chunk struct is a backing data structure per chunk, so the additional memory should not be dramatic. A chunk covers ballpark between 64kb and 512kb memory depending on some config and boot time stuff, so I believe the additional memory used here is nominal at best. Working the #s on my desktop: Percpu: 58624 kB 28 cores -> ~2.1MB of percpu memory. At say ~128KB per chunk -> 33 chunks, generously 40 chunks. Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of overhead? I believe we can do a little better to avoid eating that full padding, so likely less than that. [dennis@kernel.org: changelog details] Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com Signed-off-by: Yu Ma <yu.ma@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10 03:07:30 +00:00
unsigned long *bound_map; /* boundary map */
/*
* base_addr is the base address of this chunk.
* To reduce false sharing, current layout is optimized to make sure
* base_addr locate in the different cacheline with free_bytes and
* chunk_md.
*/
void *base_addr ____cacheline_aligned_in_smp;
percpu: replace area map allocator with bitmap The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-12 18:27:32 +00:00
unsigned long *alloc_map; /* allocation map */
struct pcpu_block_md *md_blocks; /* metadata blocks */
void *data; /* chunk data */
bool immutable; /* no [de]population allowed */
percpu: implement partial chunk depopulation From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <guro@fb.com> Co-developed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Tested-by: Pratik Sampat <psampat@linux.ibm.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 03:57:36 +00:00
bool isolated; /* isolated from active chunk
slots */
int start_offset; /* the overlap with the previous
region to have a page aligned
base_addr */
int end_offset; /* additional area required to
have the region end page
aligned */
#ifdef NEED_PCPUOBJ_EXT
struct pcpuobj_ext *obj_exts; /* vector of object cgroups */
mm: memcg/percpu: account percpu memory to memory cgroups Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning] [akpm@linux-foundation.org: move unreachable code, per Roman] [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:30:17 +00:00
#endif
int nr_pages; /* # of pages served by this chunk */
int nr_populated; /* # of populated pages */
int nr_empty_pop_pages; /* # of empty populated pages */
unsigned long populated[]; /* populated bitmap */
};
static inline bool need_pcpuobj_ext(void)
{
if (IS_ENABLED(CONFIG_MEM_ALLOC_PROFILING))
return true;
if (!mem_cgroup_kmem_disabled())
return true;
return false;
}
extern spinlock_t pcpu_lock;
mm: memcg/percpu: account percpu memory to memory cgroups Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning] [akpm@linux-foundation.org: move unreachable code, per Roman] [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 01:30:17 +00:00
extern struct list_head *pcpu_chunk_lists;
extern int pcpu_nr_slots;
percpu: implement partial chunk depopulation From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <guro@fb.com> Co-developed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Tested-by: Pratik Sampat <psampat@linux.ibm.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 03:57:36 +00:00
extern int pcpu_sidelined_slot;
extern int pcpu_to_depopulate_slot;
extern int pcpu_nr_empty_pop_pages;
extern struct pcpu_chunk *pcpu_first_chunk;
extern struct pcpu_chunk *pcpu_reserved_chunk;
/**
* pcpu_chunk_nr_blocks - converts nr_pages to # of md_blocks
* @chunk: chunk of interest
*
* This conversion is from the number of physical pages that the chunk
* serves to the number of bitmap blocks used.
*/
static inline int pcpu_chunk_nr_blocks(struct pcpu_chunk *chunk)
{
return chunk->nr_pages * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE;
}
percpu: replace area map allocator with bitmap The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-12 18:27:32 +00:00
/**
* pcpu_nr_pages_to_map_bits - converts the pages to size of bitmap
* @pages: number of physical pages
*
* This conversion is from physical pages to the number of bits
* required in the bitmap.
*/
static inline int pcpu_nr_pages_to_map_bits(int pages)
{
return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
}
/**
* pcpu_chunk_map_bits - helper to convert nr_pages to size of bitmap
* @chunk: chunk of interest
*
* This conversion is from the number of physical pages that the chunk
* serves to the number of bits in the bitmap.
*/
static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
{
return pcpu_nr_pages_to_map_bits(chunk->nr_pages);
}
/**
* pcpu_obj_full_size - helper to calculate size of each accounted object
* @size: size of area to allocate in bytes
*
* For each accounted object there is an extra space which is used to store
* obj_cgroup membership if kmemcg is not disabled. Charge it too.
*/
static inline size_t pcpu_obj_full_size(size_t size)
{
size_t extra_size = 0;
#ifdef CONFIG_MEMCG
if (!mem_cgroup_kmem_disabled())
extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
#endif
return size * num_possible_cpus() + extra_size;
}
#ifdef CONFIG_PERCPU_STATS
#include <linux/spinlock.h>
struct percpu_stats {
u64 nr_alloc; /* lifetime # of allocations */
u64 nr_dealloc; /* lifetime # of deallocations */
u64 nr_cur_alloc; /* current # of allocations */
u64 nr_max_alloc; /* max # of live allocations */
u32 nr_chunks; /* current # of live chunks */
u32 nr_max_chunks; /* max # of live chunks */
size_t min_alloc_size; /* min allocation size */
size_t max_alloc_size; /* max allocation size */
};
extern struct percpu_stats pcpu_stats;
extern struct pcpu_alloc_info pcpu_stats_ai;
/*
* For debug purposes. We don't care about the flexible array.
*/
static inline void pcpu_stats_save_ai(const struct pcpu_alloc_info *ai)
{
memcpy(&pcpu_stats_ai, ai, sizeof(struct pcpu_alloc_info));
/* initialize min_alloc_size to unit_size */
pcpu_stats.min_alloc_size = pcpu_stats_ai.unit_size;
}
/*
* pcpu_stats_area_alloc - increment area allocation stats
* @chunk: the location of the area being allocated
* @size: size of area to allocate in bytes
*
* CONTEXT:
* pcpu_lock.
*/
static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t size)
{
lockdep_assert_held(&pcpu_lock);
pcpu_stats.nr_alloc++;
pcpu_stats.nr_cur_alloc++;
pcpu_stats.nr_max_alloc =
max(pcpu_stats.nr_max_alloc, pcpu_stats.nr_cur_alloc);
pcpu_stats.min_alloc_size =
min(pcpu_stats.min_alloc_size, size);
pcpu_stats.max_alloc_size =
max(pcpu_stats.max_alloc_size, size);
chunk->nr_alloc++;
chunk->max_alloc_size = max(chunk->max_alloc_size, size);
}
/*
* pcpu_stats_area_dealloc - decrement allocation stats
* @chunk: the location of the area being deallocated
*
* CONTEXT:
* pcpu_lock.
*/
static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
{
lockdep_assert_held(&pcpu_lock);
pcpu_stats.nr_dealloc++;
pcpu_stats.nr_cur_alloc--;
chunk->nr_alloc--;
}
/*
* pcpu_stats_chunk_alloc - increment chunk stats
*/
static inline void pcpu_stats_chunk_alloc(void)
{
unsigned long flags;
spin_lock_irqsave(&pcpu_lock, flags);
pcpu_stats.nr_chunks++;
pcpu_stats.nr_max_chunks =
max(pcpu_stats.nr_max_chunks, pcpu_stats.nr_chunks);
spin_unlock_irqrestore(&pcpu_lock, flags);
}
/*
* pcpu_stats_chunk_dealloc - decrement chunk stats
*/
static inline void pcpu_stats_chunk_dealloc(void)
{
unsigned long flags;
spin_lock_irqsave(&pcpu_lock, flags);
pcpu_stats.nr_chunks--;
spin_unlock_irqrestore(&pcpu_lock, flags);
}
#else
static inline void pcpu_stats_save_ai(const struct pcpu_alloc_info *ai)
{
}
static inline void pcpu_stats_area_alloc(struct pcpu_chunk *chunk, size_t size)
{
}
static inline void pcpu_stats_area_dealloc(struct pcpu_chunk *chunk)
{
}
static inline void pcpu_stats_chunk_alloc(void)
{
}
static inline void pcpu_stats_chunk_dealloc(void)
{
}
#endif /* !CONFIG_PERCPU_STATS */
#endif