mm/lru: revise the comments of lru_lock
Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix the incorrect comments in code. Also fixed some zone->lru_lock comment error from ancient time. etc. I struggled to understand the comment above move_pages_to_lru() (surely it never calls page_referenced()), and eventually realized that most of it had got separated from shrink_active_list(): move that comment back. Link: https://lkml.kernel.org/r/1604566549-62481-20-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Jann Horn <jannh@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
2a5e4e340b
commit
15b4473617
@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
|||||||
|
|
||||||
8. LRU
|
8. LRU
|
||||||
======
|
======
|
||||||
Each memcg has its own private LRU. Now, its handling is under global
|
Each memcg has its own vector of LRUs (inactive anon, active anon,
|
||||||
VM's control (means that it's handled under global pgdat->lru_lock).
|
inactive file, active file, unevictable) of pages from each node,
|
||||||
Almost all routines around memcg's LRU is called by global LRU's
|
each LRU handled under a single lru_lock for that memcg and node.
|
||||||
list management functions under pgdat->lru_lock.
|
|
||||||
|
|
||||||
A special function is mem_cgroup_isolate_pages(). This scans
|
|
||||||
memcg's private LRU and call __isolate_lru_page() to extract a page
|
|
||||||
from LRU.
|
|
||||||
|
|
||||||
(By __isolate_lru_page(), the page is removed from both of global and
|
|
||||||
private LRU.)
|
|
||||||
|
|
||||||
|
|
||||||
9. Typical Tests.
|
9. Typical Tests.
|
||||||
=================
|
=================
|
||||||
|
@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
|
|||||||
2.6 Locking
|
2.6 Locking
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
lock_page_cgroup()/unlock_page_cgroup() should not be called under
|
Lock order is as follows:
|
||||||
the i_pages lock.
|
|
||||||
|
|
||||||
Other lock order is following:
|
Page lock (PG_locked bit of page->flags)
|
||||||
|
mm->page_table_lock or split pte_lock
|
||||||
|
lock_page_memcg (memcg->move_lock)
|
||||||
|
mapping->i_pages lock
|
||||||
|
lruvec->lru_lock.
|
||||||
|
|
||||||
PG_locked.
|
Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
|
||||||
mm->page_table_lock
|
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
|
||||||
pgdat->lru_lock
|
isolating a page from its LRU under lruvec->lru_lock.
|
||||||
lock_page_cgroup.
|
|
||||||
|
|
||||||
In many cases, just lock_page_cgroup() is called.
|
|
||||||
|
|
||||||
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
|
|
||||||
pgdat->lru_lock, it has no lock of its own.
|
|
||||||
|
|
||||||
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
|
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
|
||||||
-----------------------------------------------
|
-----------------------------------------------
|
||||||
|
@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
|
|||||||
Broadly speaking, pages are taken off the LRU lock in bulk and
|
Broadly speaking, pages are taken off the LRU lock in bulk and
|
||||||
freed in batch with a page list. Significant amounts of activity here could
|
freed in batch with a page list. Significant amounts of activity here could
|
||||||
indicate that the system is under memory pressure and can also indicate
|
indicate that the system is under memory pressure and can also indicate
|
||||||
contention on the zone->lru_lock.
|
contention on the lruvec->lru_lock.
|
||||||
|
|
||||||
4. Per-CPU Allocator Activity
|
4. Per-CPU Allocator Activity
|
||||||
=============================
|
=============================
|
||||||
|
@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
|
|||||||
memory x86_64 systems.
|
memory x86_64 systems.
|
||||||
|
|
||||||
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
|
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
|
||||||
main memory will have over 32 million 4k pages in a single zone. When a large
|
main memory will have over 32 million 4k pages in a single node. When a large
|
||||||
fraction of these pages are not evictable for any reason [see below], vmscan
|
fraction of these pages are not evictable for any reason [see below], vmscan
|
||||||
will spend a lot of time scanning the LRU lists looking for the small fraction
|
will spend a lot of time scanning the LRU lists looking for the small fraction
|
||||||
of pages that are evictable. This can result in a situation where all CPUs are
|
of pages that are evictable. This can result in a situation where all CPUs are
|
||||||
@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
|
|||||||
The Unevictable Page List
|
The Unevictable Page List
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
|
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
|
||||||
called the "unevictable" list and an associated page flag, PG_unevictable, to
|
called the "unevictable" list and an associated page flag, PG_unevictable, to
|
||||||
indicate that the page is being managed on the unevictable list.
|
indicate that the page is being managed on the unevictable list.
|
||||||
|
|
||||||
@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
|
|||||||
swap-backed pages. This differentiation is only important while the pages are,
|
swap-backed pages. This differentiation is only important while the pages are,
|
||||||
in fact, evictable.
|
in fact, evictable.
|
||||||
|
|
||||||
The unevictable list benefits from the "arrayification" of the per-zone LRU
|
The unevictable list benefits from the "arrayification" of the per-node LRU
|
||||||
lists and statistics originally proposed and posted by Christoph Lameter.
|
lists and statistics originally proposed and posted by Christoph Lameter.
|
||||||
|
|
||||||
The unevictable list does not use the LRU pagevec mechanism. Rather,
|
|
||||||
unevictable pages are placed directly on the page's zone's unevictable list
|
|
||||||
under the zone lru_lock. This allows us to prevent the stranding of pages on
|
|
||||||
the unevictable list when one task has the page isolated from the LRU and other
|
|
||||||
tasks are changing the "evictability" state of the page.
|
|
||||||
|
|
||||||
|
|
||||||
Memory Control Group Interaction
|
Memory Control Group Interaction
|
||||||
--------------------------------
|
--------------------------------
|
||||||
@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
|
|||||||
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
|
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
|
||||||
lru_list enum.
|
lru_list enum.
|
||||||
|
|
||||||
The memory controller data structure automatically gets a per-zone unevictable
|
The memory controller data structure automatically gets a per-node unevictable
|
||||||
list as a result of the "arrayification" of the per-zone LRU lists (one per
|
list as a result of the "arrayification" of the per-node LRU lists (one per
|
||||||
lru_list enum element). The memory controller tracks the movement of pages to
|
lru_list enum element). The memory controller tracks the movement of pages to
|
||||||
and from the unevictable list.
|
and from the unevictable list.
|
||||||
|
|
||||||
@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
|
|||||||
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
|
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
|
||||||
pages in all of the shrink_{active|inactive|page}_list() functions and will
|
pages in all of the shrink_{active|inactive|page}_list() functions and will
|
||||||
"cull" such pages that it encounters: that is, it diverts those pages to the
|
"cull" such pages that it encounters: that is, it diverts those pages to the
|
||||||
unevictable list for the zone being scanned.
|
unevictable list for the node being scanned.
|
||||||
|
|
||||||
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
|
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
|
||||||
page is not marked as PG_mlocked. Such pages will make it all the way to
|
page is not marked as PG_mlocked. Such pages will make it all the way to
|
||||||
@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
|
|||||||
page from the LRU, as it is likely on the appropriate active or inactive list
|
page from the LRU, as it is likely on the appropriate active or inactive list
|
||||||
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
|
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
|
||||||
back the page - by calling putback_lru_page() - which will notice that the page
|
back the page - by calling putback_lru_page() - which will notice that the page
|
||||||
is now mlocked and divert the page to the zone's unevictable list. If
|
is now mlocked and divert the page to the node's unevictable list. If
|
||||||
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
|
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
|
||||||
it later if and when it attempts to reclaim the page.
|
it later if and when it attempts to reclaim the page.
|
||||||
|
|
||||||
@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
|
|||||||
unevictable list in mlock_vma_page().
|
unevictable list in mlock_vma_page().
|
||||||
|
|
||||||
shrink_inactive_list() also diverts any unevictable pages that it finds on the
|
shrink_inactive_list() also diverts any unevictable pages that it finds on the
|
||||||
inactive lists to the appropriate zone's unevictable list.
|
inactive lists to the appropriate node's unevictable list.
|
||||||
|
|
||||||
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
|
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
|
||||||
after shrink_active_list() had moved them to the inactive list, or pages mapped
|
after shrink_active_list() had moved them to the inactive list, or pages mapped
|
||||||
|
@ -79,7 +79,7 @@ struct page {
|
|||||||
struct { /* Page cache and anonymous pages */
|
struct { /* Page cache and anonymous pages */
|
||||||
/**
|
/**
|
||||||
* @lru: Pageout list, eg. active_list protected by
|
* @lru: Pageout list, eg. active_list protected by
|
||||||
* pgdat->lru_lock. Sometimes used as a generic list
|
* lruvec->lru_lock. Sometimes used as a generic list
|
||||||
* by the page owner.
|
* by the page owner.
|
||||||
*/
|
*/
|
||||||
struct list_head lru;
|
struct list_head lru;
|
||||||
|
@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
|
|||||||
struct pglist_data;
|
struct pglist_data;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
|
* Add a wild amount of padding here to ensure datas fall into separate
|
||||||
* So add a wild amount of padding here to ensure that they fall into separate
|
|
||||||
* cachelines. There are very few zone structures in the machine, so space
|
* cachelines. There are very few zone structures in the machine, so space
|
||||||
* consumption is not a concern here.
|
* consumption is not a concern here.
|
||||||
*/
|
*/
|
||||||
|
@ -102,8 +102,8 @@
|
|||||||
* ->swap_lock (try_to_unmap_one)
|
* ->swap_lock (try_to_unmap_one)
|
||||||
* ->private_lock (try_to_unmap_one)
|
* ->private_lock (try_to_unmap_one)
|
||||||
* ->i_pages lock (try_to_unmap_one)
|
* ->i_pages lock (try_to_unmap_one)
|
||||||
* ->pgdat->lru_lock (follow_page->mark_page_accessed)
|
* ->lruvec->lru_lock (follow_page->mark_page_accessed)
|
||||||
* ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
|
* ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
|
||||||
* ->private_lock (page_remove_rmap->set_page_dirty)
|
* ->private_lock (page_remove_rmap->set_page_dirty)
|
||||||
* ->i_pages lock (page_remove_rmap->set_page_dirty)
|
* ->i_pages lock (page_remove_rmap->set_page_dirty)
|
||||||
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
|
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
|
||||||
|
@ -28,12 +28,12 @@
|
|||||||
* hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
|
* hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
|
||||||
* anon_vma->rwsem
|
* anon_vma->rwsem
|
||||||
* mm->page_table_lock or pte_lock
|
* mm->page_table_lock or pte_lock
|
||||||
* pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
|
|
||||||
* swap_lock (in swap_duplicate, swap_info_get)
|
* swap_lock (in swap_duplicate, swap_info_get)
|
||||||
* mmlist_lock (in mmput, drain_mmlist and others)
|
* mmlist_lock (in mmput, drain_mmlist and others)
|
||||||
* mapping->private_lock (in __set_page_dirty_buffers)
|
* mapping->private_lock (in __set_page_dirty_buffers)
|
||||||
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
|
* lock_page_memcg move_lock (in __set_page_dirty_buffers)
|
||||||
* i_pages lock (widely used)
|
* i_pages lock (widely used)
|
||||||
|
* lruvec->lru_lock (in lock_page_lruvec_irq)
|
||||||
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
|
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
|
||||||
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
|
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
|
||||||
* sb_lock (within inode_lock in fs/fs-writeback.c)
|
* sb_lock (within inode_lock in fs/fs-writeback.c)
|
||||||
|
41
mm/vmscan.c
41
mm/vmscan.c
@ -1613,14 +1613,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* pgdat->lru_lock is heavily contended. Some of the functions that
|
* Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
|
||||||
|
*
|
||||||
|
* lruvec->lru_lock is heavily contended. Some of the functions that
|
||||||
* shrink the lists perform better by taking out a batch of pages
|
* shrink the lists perform better by taking out a batch of pages
|
||||||
* and working on them outside the LRU lock.
|
* and working on them outside the LRU lock.
|
||||||
*
|
*
|
||||||
* For pagecache intensive workloads, this function is the hottest
|
* For pagecache intensive workloads, this function is the hottest
|
||||||
* spot in the kernel (apart from copy_*_user functions).
|
* spot in the kernel (apart from copy_*_user functions).
|
||||||
*
|
*
|
||||||
* Appropriate locks must be held before calling this function.
|
* Lru_lock must be held before calling this function.
|
||||||
*
|
*
|
||||||
* @nr_to_scan: The number of eligible pages to look through on the list.
|
* @nr_to_scan: The number of eligible pages to look through on the list.
|
||||||
* @lruvec: The LRU vector to pull pages from.
|
* @lruvec: The LRU vector to pull pages from.
|
||||||
@ -1814,25 +1816,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* This moves pages from @list to corresponding LRU list.
|
* move_pages_to_lru() moves pages from private @list to appropriate LRU list.
|
||||||
*
|
* On return, @list is reused as a list of pages to be freed by the caller.
|
||||||
* We move them the other way if the page is referenced by one or more
|
|
||||||
* processes, from rmap.
|
|
||||||
*
|
|
||||||
* If the pages are mostly unmapped, the processing is fast and it is
|
|
||||||
* appropriate to hold zone_lru_lock across the whole operation. But if
|
|
||||||
* the pages are mapped, the processing is slow (page_referenced()) so we
|
|
||||||
* should drop zone_lru_lock around each page. It's impossible to balance
|
|
||||||
* this, so instead we remove the pages from the LRU while processing them.
|
|
||||||
* It is safe to rely on PG_active against the non-LRU pages in here because
|
|
||||||
* nobody will play with that bit on a non-LRU page.
|
|
||||||
*
|
|
||||||
* The downside is that we have to touch page->_refcount against each page.
|
|
||||||
* But we had to alter page->flags anyway.
|
|
||||||
*
|
*
|
||||||
* Returns the number of pages moved to the given lruvec.
|
* Returns the number of pages moved to the given lruvec.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
|
static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
|
||||||
struct list_head *list)
|
struct list_head *list)
|
||||||
{
|
{
|
||||||
@ -2010,6 +1998,23 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
|
|||||||
return nr_reclaimed;
|
return nr_reclaimed;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* shrink_active_list() moves pages from the active LRU to the inactive LRU.
|
||||||
|
*
|
||||||
|
* We move them the other way if the page is referenced by one or more
|
||||||
|
* processes.
|
||||||
|
*
|
||||||
|
* If the pages are mostly unmapped, the processing is fast and it is
|
||||||
|
* appropriate to hold lru_lock across the whole operation. But if
|
||||||
|
* the pages are mapped, the processing is slow (page_referenced()), so
|
||||||
|
* we should drop lru_lock around each page. It's impossible to balance
|
||||||
|
* this, so instead we remove the pages from the LRU while processing them.
|
||||||
|
* It is safe to rely on PG_active against the non-LRU pages in here because
|
||||||
|
* nobody will play with that bit on a non-LRU page.
|
||||||
|
*
|
||||||
|
* The downside is that we have to touch page->_refcount against each page.
|
||||||
|
* But we had to alter page->flags anyway.
|
||||||
|
*/
|
||||||
static void shrink_active_list(unsigned long nr_to_scan,
|
static void shrink_active_list(unsigned long nr_to_scan,
|
||||||
struct lruvec *lruvec,
|
struct lruvec *lruvec,
|
||||||
struct scan_control *sc,
|
struct scan_control *sc,
|
||||||
|
Loading…
Reference in New Issue
Block a user