forked from Minki/linux
Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
This commit is contained in:
commit
e4bc13adfd
@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
|
||||
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
|
||||
on individual groups and throughput should improve.
|
||||
|
||||
What works
|
||||
==========
|
||||
- Currently only sync IO queues are support. All the buffered writes are
|
||||
still system wide and not per group. Hence we will not see service
|
||||
differentiation between buffered writes between groups.
|
||||
Writeback
|
||||
=========
|
||||
|
||||
Page cache is dirtied through buffered writes and shared mmaps and
|
||||
written asynchronously to the backing filesystem by the writeback
|
||||
mechanism. Writeback sits between the memory and IO domains and
|
||||
regulates the proportion of dirty memory by balancing dirtying and
|
||||
write IOs.
|
||||
|
||||
On traditional cgroup hierarchies, relationships between different
|
||||
controllers cannot be established making it impossible for writeback
|
||||
to operate accounting for cgroup resource restrictions and all
|
||||
writeback IOs are attributed to the root cgroup.
|
||||
|
||||
If both the blkio and memory controllers are used on the v2 hierarchy
|
||||
and the filesystem supports cgroup writeback, writeback operations
|
||||
correctly follow the resource restrictions imposed by both memory and
|
||||
blkio controllers.
|
||||
|
||||
Writeback examines both system-wide and per-cgroup dirty memory status
|
||||
and enforces the more restrictive of the two. Also, writeback control
|
||||
parameters which are absolute values - vm.dirty_bytes and
|
||||
vm.dirty_background_bytes - are distributed across cgroups according
|
||||
to their current writeback bandwidth.
|
||||
|
||||
There's a peculiarity stemming from the discrepancy in ownership
|
||||
granularity between memory controller and writeback. While memory
|
||||
controller tracks ownership per page, writeback operates on inode
|
||||
basis. cgroup writeback bridges the gap by tracking ownership by
|
||||
inode but migrating ownership if too many foreign pages, pages which
|
||||
don't match the current inode ownership, have been encountered while
|
||||
writing back the inode.
|
||||
|
||||
This is a conscious design choice as writeback operations are
|
||||
inherently tied to inodes making strictly following page ownership
|
||||
complicated and inefficient. The only use case which suffers from
|
||||
this compromise is multiple cgroups concurrently dirtying disjoint
|
||||
regions of the same inode, which is an unlikely use case and decided
|
||||
to be unsupported. Note that as memory controller assigns page
|
||||
ownership on the first use and doesn't update it until the page is
|
||||
released, even if cgroup writeback strictly follows page ownership,
|
||||
multiple cgroups dirtying overlapping areas wouldn't work as expected.
|
||||
In general, write-sharing an inode across multiple cgroups is not well
|
||||
supported.
|
||||
|
||||
Filesystem support for cgroup writeback
|
||||
---------------------------------------
|
||||
|
||||
A filesystem can make writeback IOs cgroup-aware by updating
|
||||
address_space_operations->writepage[s]() to annotate bio's using the
|
||||
following two functions.
|
||||
|
||||
* wbc_init_bio(@wbc, @bio)
|
||||
|
||||
Should be called for each bio carrying writeback data and associates
|
||||
the bio with the inode's owner cgroup. Can be called anytime
|
||||
between bio allocation and submission.
|
||||
|
||||
* wbc_account_io(@wbc, @page, @bytes)
|
||||
|
||||
Should be called for each data segment being written out. While
|
||||
this function doesn't care exactly when it's called during the
|
||||
writeback session, it's the easiest and most natural to call it as
|
||||
data segments are added to a bio.
|
||||
|
||||
With writeback bio's annotated, cgroup support can be enabled per
|
||||
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
|
||||
selective disabling of cgroup writeback support which is helpful when
|
||||
certain filesystem features, e.g. journaled data mode, are
|
||||
incompatible.
|
||||
|
||||
wbc_init_bio() binds the specified bio to its cgroup. Depending on
|
||||
the configuration, the bio may be executed at a lower priority and if
|
||||
the writeback session is holding shared resources, e.g. a journal
|
||||
entry, may lead to priority inversion. There is no one easy solution
|
||||
for the problem. Filesystems can try to work around specific problem
|
||||
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
|
||||
directly.
|
||||
|
@ -493,6 +493,7 @@ pgpgin - # of charging events to the memory cgroup. The charging
|
||||
pgpgout - # of uncharging events to the memory cgroup. The uncharging
|
||||
event happens each time a page is unaccounted from the cgroup.
|
||||
swap - # of bytes of swap usage
|
||||
dirty - # of bytes that are waiting to get written back to the disk.
|
||||
writeback - # of bytes of file/anon cache that are queued for syncing to
|
||||
disk.
|
||||
inactive_anon - # of bytes of anonymous and swap cache memory on inactive
|
||||
|
35
block/bio.c
35
block/bio.c
@ -1988,6 +1988,28 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_
|
||||
EXPORT_SYMBOL(bioset_create_nobvec);
|
||||
|
||||
#ifdef CONFIG_BLK_CGROUP
|
||||
|
||||
/**
|
||||
* bio_associate_blkcg - associate a bio with the specified blkcg
|
||||
* @bio: target bio
|
||||
* @blkcg_css: css of the blkcg to associate
|
||||
*
|
||||
* Associate @bio with the blkcg specified by @blkcg_css. Block layer will
|
||||
* treat @bio as if it were issued by a task which belongs to the blkcg.
|
||||
*
|
||||
* This function takes an extra reference of @blkcg_css which will be put
|
||||
* when @bio is released. The caller must own @bio and is responsible for
|
||||
* synchronizing calls to this function.
|
||||
*/
|
||||
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
|
||||
{
|
||||
if (unlikely(bio->bi_css))
|
||||
return -EBUSY;
|
||||
css_get(blkcg_css);
|
||||
bio->bi_css = blkcg_css;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* bio_associate_current - associate a bio with %current
|
||||
* @bio: target bio
|
||||
@ -2004,26 +2026,17 @@ EXPORT_SYMBOL(bioset_create_nobvec);
|
||||
int bio_associate_current(struct bio *bio)
|
||||
{
|
||||
struct io_context *ioc;
|
||||
struct cgroup_subsys_state *css;
|
||||
|
||||
if (bio->bi_ioc)
|
||||
if (bio->bi_css)
|
||||
return -EBUSY;
|
||||
|
||||
ioc = current->io_context;
|
||||
if (!ioc)
|
||||
return -ENOENT;
|
||||
|
||||
/* acquire active ref on @ioc and associate */
|
||||
get_io_context_active(ioc);
|
||||
bio->bi_ioc = ioc;
|
||||
|
||||
/* associate blkcg if exists */
|
||||
rcu_read_lock();
|
||||
css = task_css(current, blkio_cgrp_id);
|
||||
if (css && css_tryget_online(css))
|
||||
bio->bi_css = css;
|
||||
rcu_read_unlock();
|
||||
|
||||
bio->bi_css = task_get_css(current, blkio_cgrp_id);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -19,11 +19,12 @@
|
||||
#include <linux/module.h>
|
||||
#include <linux/err.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/genhd.h>
|
||||
#include <linux/delay.h>
|
||||
#include <linux/atomic.h>
|
||||
#include "blk-cgroup.h"
|
||||
#include <linux/blk-cgroup.h>
|
||||
#include "blk.h"
|
||||
|
||||
#define MAX_KEY_LEN 100
|
||||
@ -33,6 +34,8 @@ static DEFINE_MUTEX(blkcg_pol_mutex);
|
||||
struct blkcg blkcg_root;
|
||||
EXPORT_SYMBOL_GPL(blkcg_root);
|
||||
|
||||
struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
|
||||
|
||||
static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
|
||||
|
||||
static bool blkcg_policy_enabled(struct request_queue *q,
|
||||
@ -182,6 +185,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
|
||||
struct blkcg_gq *new_blkg)
|
||||
{
|
||||
struct blkcg_gq *blkg;
|
||||
struct bdi_writeback_congested *wb_congested;
|
||||
int i, ret;
|
||||
|
||||
WARN_ON_ONCE(!rcu_read_lock_held());
|
||||
@ -193,22 +197,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
|
||||
goto err_free_blkg;
|
||||
}
|
||||
|
||||
wb_congested = wb_congested_get_create(&q->backing_dev_info,
|
||||
blkcg->css.id, GFP_ATOMIC);
|
||||
if (!wb_congested) {
|
||||
ret = -ENOMEM;
|
||||
goto err_put_css;
|
||||
}
|
||||
|
||||
/* allocate */
|
||||
if (!new_blkg) {
|
||||
new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
|
||||
if (unlikely(!new_blkg)) {
|
||||
ret = -ENOMEM;
|
||||
goto err_put_css;
|
||||
goto err_put_congested;
|
||||
}
|
||||
}
|
||||
blkg = new_blkg;
|
||||
blkg->wb_congested = wb_congested;
|
||||
|
||||
/* link parent */
|
||||
if (blkcg_parent(blkcg)) {
|
||||
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
|
||||
if (WARN_ON_ONCE(!blkg->parent)) {
|
||||
ret = -EINVAL;
|
||||
goto err_put_css;
|
||||
goto err_put_congested;
|
||||
}
|
||||
blkg_get(blkg->parent);
|
||||
}
|
||||
@ -238,18 +250,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
|
||||
blkg->online = true;
|
||||
spin_unlock(&blkcg->lock);
|
||||
|
||||
if (!ret) {
|
||||
if (blkcg == &blkcg_root) {
|
||||
q->root_blkg = blkg;
|
||||
q->root_rl.blkg = blkg;
|
||||
}
|
||||
if (!ret)
|
||||
return blkg;
|
||||
}
|
||||
|
||||
/* @blkg failed fully initialized, use the usual release path */
|
||||
blkg_put(blkg);
|
||||
return ERR_PTR(ret);
|
||||
|
||||
err_put_congested:
|
||||
wb_congested_put(wb_congested);
|
||||
err_put_css:
|
||||
css_put(&blkcg->css);
|
||||
err_free_blkg:
|
||||
@ -342,15 +351,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
|
||||
if (rcu_access_pointer(blkcg->blkg_hint) == blkg)
|
||||
rcu_assign_pointer(blkcg->blkg_hint, NULL);
|
||||
|
||||
/*
|
||||
* If root blkg is destroyed. Just clear the pointer since root_rl
|
||||
* does not take reference on root blkg.
|
||||
*/
|
||||
if (blkcg == &blkcg_root) {
|
||||
blkg->q->root_blkg = NULL;
|
||||
blkg->q->root_rl.blkg = NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* Put the reference taken at the time of creation so that when all
|
||||
* queues are gone, group can be destroyed.
|
||||
@ -405,6 +405,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
|
||||
if (blkg->parent)
|
||||
blkg_put(blkg->parent);
|
||||
|
||||
wb_congested_put(blkg->wb_congested);
|
||||
|
||||
blkg_free(blkg);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(__blkg_release_rcu);
|
||||
@ -812,6 +814,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css)
|
||||
}
|
||||
|
||||
spin_unlock_irq(&blkcg->lock);
|
||||
|
||||
wb_blkcg_offline(blkcg);
|
||||
}
|
||||
|
||||
static void blkcg_css_free(struct cgroup_subsys_state *css)
|
||||
@ -868,7 +872,9 @@ done:
|
||||
spin_lock_init(&blkcg->lock);
|
||||
INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
|
||||
INIT_HLIST_HEAD(&blkcg->blkg_list);
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
INIT_LIST_HEAD(&blkcg->cgwb_list);
|
||||
#endif
|
||||
return &blkcg->css;
|
||||
|
||||
free_pd_blkcg:
|
||||
@ -892,9 +898,45 @@ free_blkcg:
|
||||
*/
|
||||
int blkcg_init_queue(struct request_queue *q)
|
||||
{
|
||||
might_sleep();
|
||||
struct blkcg_gq *new_blkg, *blkg;
|
||||
bool preloaded;
|
||||
int ret;
|
||||
|
||||
return blk_throtl_init(q);
|
||||
new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
|
||||
if (!new_blkg)
|
||||
return -ENOMEM;
|
||||
|
||||
preloaded = !radix_tree_preload(GFP_KERNEL);
|
||||
|
||||
/*
|
||||
* Make sure the root blkg exists and count the existing blkgs. As
|
||||
* @q is bypassing at this point, blkg_lookup_create() can't be
|
||||
* used. Open code insertion.
|
||||
*/
|
||||
rcu_read_lock();
|
||||
spin_lock_irq(q->queue_lock);
|
||||
blkg = blkg_create(&blkcg_root, q, new_blkg);
|
||||
spin_unlock_irq(q->queue_lock);
|
||||
rcu_read_unlock();
|
||||
|
||||
if (preloaded)
|
||||
radix_tree_preload_end();
|
||||
|
||||
if (IS_ERR(blkg)) {
|
||||
kfree(new_blkg);
|
||||
return PTR_ERR(blkg);
|
||||
}
|
||||
|
||||
q->root_blkg = blkg;
|
||||
q->root_rl.blkg = blkg;
|
||||
|
||||
ret = blk_throtl_init(q);
|
||||
if (ret) {
|
||||
spin_lock_irq(q->queue_lock);
|
||||
blkg_destroy_all(q);
|
||||
spin_unlock_irq(q->queue_lock);
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
/**
|
||||
@ -996,50 +1038,19 @@ int blkcg_activate_policy(struct request_queue *q,
|
||||
{
|
||||
LIST_HEAD(pds);
|
||||
LIST_HEAD(cpds);
|
||||
struct blkcg_gq *blkg, *new_blkg;
|
||||
struct blkcg_gq *blkg;
|
||||
struct blkg_policy_data *pd, *nd;
|
||||
struct blkcg_policy_data *cpd, *cnd;
|
||||
int cnt = 0, ret;
|
||||
bool preloaded;
|
||||
|
||||
if (blkcg_policy_enabled(q, pol))
|
||||
return 0;
|
||||
|
||||
/* preallocations for root blkg */
|
||||
new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
|
||||
if (!new_blkg)
|
||||
return -ENOMEM;
|
||||
|
||||
/* count and allocate policy_data for all existing blkgs */
|
||||
blk_queue_bypass_start(q);
|
||||
|
||||
preloaded = !radix_tree_preload(GFP_KERNEL);
|
||||
|
||||
/*
|
||||
* Make sure the root blkg exists and count the existing blkgs. As
|
||||
* @q is bypassing at this point, blkg_lookup_create() can't be
|
||||
* used. Open code it.
|
||||
*/
|
||||
spin_lock_irq(q->queue_lock);
|
||||
|
||||
rcu_read_lock();
|
||||
blkg = __blkg_lookup(&blkcg_root, q, false);
|
||||
if (blkg)
|
||||
blkg_free(new_blkg);
|
||||
else
|
||||
blkg = blkg_create(&blkcg_root, q, new_blkg);
|
||||
rcu_read_unlock();
|
||||
|
||||
if (preloaded)
|
||||
radix_tree_preload_end();
|
||||
|
||||
if (IS_ERR(blkg)) {
|
||||
ret = PTR_ERR(blkg);
|
||||
goto out_unlock;
|
||||
}
|
||||
|
||||
list_for_each_entry(blkg, &q->blkg_list, q_node)
|
||||
cnt++;
|
||||
|
||||
spin_unlock_irq(q->queue_lock);
|
||||
|
||||
/*
|
||||
@ -1140,10 +1151,6 @@ void blkcg_deactivate_policy(struct request_queue *q,
|
||||
|
||||
__clear_bit(pol->plid, q->blkcg_pols);
|
||||
|
||||
/* if no policy is left, no need for blkgs - shoot them down */
|
||||
if (bitmap_empty(q->blkcg_pols, BLKCG_MAX_POLS))
|
||||
blkg_destroy_all(q);
|
||||
|
||||
list_for_each_entry(blkg, &q->blkg_list, q_node) {
|
||||
/* grab blkcg lock too while removing @pd from @blkg */
|
||||
spin_lock(&blkg->blkcg->lock);
|
||||
|
@ -32,12 +32,12 @@
|
||||
#include <linux/delay.h>
|
||||
#include <linux/ratelimit.h>
|
||||
#include <linux/pm_runtime.h>
|
||||
#include <linux/blk-cgroup.h>
|
||||
|
||||
#define CREATE_TRACE_POINTS
|
||||
#include <trace/events/block.h>
|
||||
|
||||
#include "blk.h"
|
||||
#include "blk-cgroup.h"
|
||||
#include "blk-mq.h"
|
||||
|
||||
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
|
||||
@ -63,6 +63,31 @@ struct kmem_cache *blk_requestq_cachep;
|
||||
*/
|
||||
static struct workqueue_struct *kblockd_workqueue;
|
||||
|
||||
static void blk_clear_congested(struct request_list *rl, int sync)
|
||||
{
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
clear_wb_congested(rl->blkg->wb_congested, sync);
|
||||
#else
|
||||
/*
|
||||
* If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
|
||||
* flip its congestion state for events on other blkcgs.
|
||||
*/
|
||||
if (rl == &rl->q->root_rl)
|
||||
clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
|
||||
#endif
|
||||
}
|
||||
|
||||
static void blk_set_congested(struct request_list *rl, int sync)
|
||||
{
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
set_wb_congested(rl->blkg->wb_congested, sync);
|
||||
#else
|
||||
/* see blk_clear_congested() */
|
||||
if (rl == &rl->q->root_rl)
|
||||
set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
|
||||
#endif
|
||||
}
|
||||
|
||||
void blk_queue_congestion_threshold(struct request_queue *q)
|
||||
{
|
||||
int nr;
|
||||
@ -623,8 +648,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
|
||||
|
||||
q->backing_dev_info.ra_pages =
|
||||
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
|
||||
q->backing_dev_info.state = 0;
|
||||
q->backing_dev_info.capabilities = 0;
|
||||
q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
|
||||
q->backing_dev_info.name = "block";
|
||||
q->node = node_id;
|
||||
|
||||
@ -847,13 +871,8 @@ static void __freed_request(struct request_list *rl, int sync)
|
||||
{
|
||||
struct request_queue *q = rl->q;
|
||||
|
||||
/*
|
||||
* bdi isn't aware of blkcg yet. As all async IOs end up root
|
||||
* blkcg anyway, just use root blkcg state.
|
||||
*/
|
||||
if (rl == &q->root_rl &&
|
||||
rl->count[sync] < queue_congestion_off_threshold(q))
|
||||
blk_clear_queue_congested(q, sync);
|
||||
if (rl->count[sync] < queue_congestion_off_threshold(q))
|
||||
blk_clear_congested(rl, sync);
|
||||
|
||||
if (rl->count[sync] + 1 <= q->nr_requests) {
|
||||
if (waitqueue_active(&rl->wait[sync]))
|
||||
@ -886,25 +905,25 @@ static void freed_request(struct request_list *rl, unsigned int flags)
|
||||
int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
|
||||
{
|
||||
struct request_list *rl;
|
||||
int on_thresh, off_thresh;
|
||||
|
||||
spin_lock_irq(q->queue_lock);
|
||||
q->nr_requests = nr;
|
||||
blk_queue_congestion_threshold(q);
|
||||
|
||||
/* congestion isn't cgroup aware and follows root blkcg for now */
|
||||
rl = &q->root_rl;
|
||||
|
||||
if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
|
||||
blk_set_queue_congested(q, BLK_RW_SYNC);
|
||||
else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
|
||||
blk_clear_queue_congested(q, BLK_RW_SYNC);
|
||||
|
||||
if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
|
||||
blk_set_queue_congested(q, BLK_RW_ASYNC);
|
||||
else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
|
||||
blk_clear_queue_congested(q, BLK_RW_ASYNC);
|
||||
on_thresh = queue_congestion_on_threshold(q);
|
||||
off_thresh = queue_congestion_off_threshold(q);
|
||||
|
||||
blk_queue_for_each_rl(rl, q) {
|
||||
if (rl->count[BLK_RW_SYNC] >= on_thresh)
|
||||
blk_set_congested(rl, BLK_RW_SYNC);
|
||||
else if (rl->count[BLK_RW_SYNC] < off_thresh)
|
||||
blk_clear_congested(rl, BLK_RW_SYNC);
|
||||
|
||||
if (rl->count[BLK_RW_ASYNC] >= on_thresh)
|
||||
blk_set_congested(rl, BLK_RW_ASYNC);
|
||||
else if (rl->count[BLK_RW_ASYNC] < off_thresh)
|
||||
blk_clear_congested(rl, BLK_RW_ASYNC);
|
||||
|
||||
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
|
||||
blk_set_rl_full(rl, BLK_RW_SYNC);
|
||||
} else {
|
||||
@ -1014,12 +1033,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
|
||||
}
|
||||
}
|
||||
}
|
||||
/*
|
||||
* bdi isn't aware of blkcg yet. As all async IOs end up
|
||||
* root blkcg anyway, just use root blkcg state.
|
||||
*/
|
||||
if (rl == &q->root_rl)
|
||||
blk_set_queue_congested(q, is_sync);
|
||||
blk_set_congested(rl, is_sync);
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -21,6 +21,7 @@
|
||||
*/
|
||||
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/mempool.h>
|
||||
#include <linux/bio.h>
|
||||
#include <linux/scatterlist.h>
|
||||
|
@ -6,11 +6,12 @@
|
||||
#include <linux/module.h>
|
||||
#include <linux/bio.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/blktrace_api.h>
|
||||
#include <linux/blk-mq.h>
|
||||
#include <linux/blk-cgroup.h>
|
||||
|
||||
#include "blk.h"
|
||||
#include "blk-cgroup.h"
|
||||
#include "blk-mq.h"
|
||||
|
||||
struct queue_sysfs_entry {
|
||||
|
@ -9,7 +9,7 @@
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/bio.h>
|
||||
#include <linux/blktrace_api.h>
|
||||
#include "blk-cgroup.h"
|
||||
#include <linux/blk-cgroup.h>
|
||||
#include "blk.h"
|
||||
|
||||
/* Max dispatch from a group in 1 round */
|
||||
|
@ -13,6 +13,7 @@
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/mempool.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/hash.h>
|
||||
#include <linux/highmem.h>
|
||||
|
@ -14,8 +14,8 @@
|
||||
#include <linux/rbtree.h>
|
||||
#include <linux/ioprio.h>
|
||||
#include <linux/blktrace_api.h>
|
||||
#include <linux/blk-cgroup.h>
|
||||
#include "blk.h"
|
||||
#include "blk-cgroup.h"
|
||||
|
||||
/*
|
||||
* tunables
|
||||
|
@ -35,11 +35,11 @@
|
||||
#include <linux/hash.h>
|
||||
#include <linux/uaccess.h>
|
||||
#include <linux/pm_runtime.h>
|
||||
#include <linux/blk-cgroup.h>
|
||||
|
||||
#include <trace/events/block.h>
|
||||
|
||||
#include "blk.h"
|
||||
#include "blk-cgroup.h"
|
||||
|
||||
static DEFINE_SPINLOCK(elv_list_lock);
|
||||
static LIST_HEAD(elv_list);
|
||||
|
@ -8,6 +8,7 @@
|
||||
#include <linux/kdev_t.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/proc_fs.h>
|
||||
|
@ -38,6 +38,7 @@
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/major.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/genhd.h>
|
||||
#include <linux/idr.h>
|
||||
#include <net/tcp.h>
|
||||
|
@ -2359,7 +2359,7 @@ static void drbd_cleanup(void)
|
||||
* @congested_data: User data
|
||||
* @bdi_bits: Bits the BDI flusher thread is currently interested in
|
||||
*
|
||||
* Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
|
||||
* Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
|
||||
*/
|
||||
static int drbd_congested(void *congested_data, int bdi_bits)
|
||||
{
|
||||
@ -2376,14 +2376,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
|
||||
}
|
||||
|
||||
if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
|
||||
r |= (1 << BDI_async_congested);
|
||||
r |= (1 << WB_async_congested);
|
||||
/* Without good local data, we would need to read from remote,
|
||||
* and that would need the worker thread as well, which is
|
||||
* currently blocked waiting for that usermode helper to
|
||||
* finish.
|
||||
*/
|
||||
if (!get_ldev_if_state(device, D_UP_TO_DATE))
|
||||
r |= (1 << BDI_sync_congested);
|
||||
r |= (1 << WB_sync_congested);
|
||||
else
|
||||
put_ldev(device);
|
||||
r &= bdi_bits;
|
||||
@ -2399,9 +2399,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
|
||||
reason = 'b';
|
||||
}
|
||||
|
||||
if (bdi_bits & (1 << BDI_async_congested) &&
|
||||
if (bdi_bits & (1 << WB_async_congested) &&
|
||||
test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
|
||||
r |= (1 << BDI_async_congested);
|
||||
r |= (1 << WB_async_congested);
|
||||
reason = reason == 'b' ? 'a' : 'n';
|
||||
}
|
||||
|
||||
|
@ -61,6 +61,7 @@
|
||||
#include <linux/freezer.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <scsi/scsi_cmnd.h>
|
||||
#include <scsi/scsi_ioctl.h>
|
||||
#include <scsi/scsi.h>
|
||||
|
@ -12,6 +12,7 @@
|
||||
#include <linux/fs.h>
|
||||
#include <linux/major.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/raw.h>
|
||||
#include <linux/capability.h>
|
||||
|
@ -15,6 +15,7 @@
|
||||
#include <linux/module.h>
|
||||
#include <linux/hash.h>
|
||||
#include <linux/random.h>
|
||||
#include <linux/backing-dev.h>
|
||||
|
||||
#include <trace/events/bcache.h>
|
||||
|
||||
|
@ -2080,7 +2080,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
|
||||
* the query about congestion status of request_queue
|
||||
*/
|
||||
if (dm_request_based(md))
|
||||
r = md->queue->backing_dev_info.state &
|
||||
r = md->queue->backing_dev_info.wb.state &
|
||||
bdi_bits;
|
||||
else
|
||||
r = dm_table_any_congested(map, bdi_bits);
|
||||
|
@ -14,6 +14,7 @@
|
||||
#include <linux/device-mapper.h>
|
||||
#include <linux/list.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/hdreg.h>
|
||||
#include <linux/completion.h>
|
||||
#include <linux/kobject.h>
|
||||
|
@ -16,6 +16,7 @@
|
||||
#define _MD_MD_H
|
||||
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/kobject.h>
|
||||
#include <linux/list.h>
|
||||
#include <linux/mm.h>
|
||||
|
@ -745,7 +745,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
|
||||
struct r1conf *conf = mddev->private;
|
||||
int i, ret = 0;
|
||||
|
||||
if ((bits & (1 << BDI_async_congested)) &&
|
||||
if ((bits & (1 << WB_async_congested)) &&
|
||||
conf->pending_count >= max_queued_requests)
|
||||
return 1;
|
||||
|
||||
@ -760,7 +760,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
|
||||
/* Note the '|| 1' - when read_balance prefers
|
||||
* non-congested targets, it can be removed
|
||||
*/
|
||||
if ((bits & (1<<BDI_async_congested)) || 1)
|
||||
if ((bits & (1 << WB_async_congested)) || 1)
|
||||
ret |= bdi_congested(&q->backing_dev_info, bits);
|
||||
else
|
||||
ret &= bdi_congested(&q->backing_dev_info, bits);
|
||||
|
@ -914,7 +914,7 @@ static int raid10_congested(struct mddev *mddev, int bits)
|
||||
struct r10conf *conf = mddev->private;
|
||||
int i, ret = 0;
|
||||
|
||||
if ((bits & (1 << BDI_async_congested)) &&
|
||||
if ((bits & (1 << WB_async_congested)) &&
|
||||
conf->pending_count >= max_queued_requests)
|
||||
return 1;
|
||||
|
||||
|
@ -20,6 +20,7 @@
|
||||
#include <linux/delay.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/bio.h>
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/list.h>
|
||||
|
@ -55,9 +55,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
|
||||
if (PagePrivate(page))
|
||||
page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
|
||||
|
||||
if (TestClearPageDirty(page))
|
||||
account_page_cleaned(page, mapping);
|
||||
|
||||
cancel_dirty_page(page);
|
||||
ClearPageMappedToDisk(page);
|
||||
ll_delete_from_page_cache(page);
|
||||
}
|
||||
|
50
fs/9p/v9fs.c
50
fs/9p/v9fs.c
@ -320,31 +320,21 @@ fail_option_alloc:
|
||||
struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
|
||||
const char *dev_name, char *data)
|
||||
{
|
||||
int retval = -EINVAL;
|
||||
struct p9_fid *fid;
|
||||
int rc;
|
||||
int rc = -ENOMEM;
|
||||
|
||||
v9ses->uname = kstrdup(V9FS_DEFUSER, GFP_KERNEL);
|
||||
if (!v9ses->uname)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
goto err_names;
|
||||
|
||||
v9ses->aname = kstrdup(V9FS_DEFANAME, GFP_KERNEL);
|
||||
if (!v9ses->aname) {
|
||||
kfree(v9ses->uname);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
if (!v9ses->aname)
|
||||
goto err_names;
|
||||
init_rwsem(&v9ses->rename_sem);
|
||||
|
||||
rc = bdi_setup_and_register(&v9ses->bdi, "9p");
|
||||
if (rc) {
|
||||
kfree(v9ses->aname);
|
||||
kfree(v9ses->uname);
|
||||
return ERR_PTR(rc);
|
||||
}
|
||||
|
||||
spin_lock(&v9fs_sessionlist_lock);
|
||||
list_add(&v9ses->slist, &v9fs_sessionlist);
|
||||
spin_unlock(&v9fs_sessionlist_lock);
|
||||
if (rc)
|
||||
goto err_names;
|
||||
|
||||
v9ses->uid = INVALID_UID;
|
||||
v9ses->dfltuid = V9FS_DEFUID;
|
||||
@ -352,10 +342,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
|
||||
|
||||
v9ses->clnt = p9_client_create(dev_name, data);
|
||||
if (IS_ERR(v9ses->clnt)) {
|
||||
retval = PTR_ERR(v9ses->clnt);
|
||||
v9ses->clnt = NULL;
|
||||
rc = PTR_ERR(v9ses->clnt);
|
||||
p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n");
|
||||
goto error;
|
||||
goto err_bdi;
|
||||
}
|
||||
|
||||
v9ses->flags = V9FS_ACCESS_USER;
|
||||
@ -368,10 +357,8 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
|
||||
}
|
||||
|
||||
rc = v9fs_parse_options(v9ses, data);
|
||||
if (rc < 0) {
|
||||
retval = rc;
|
||||
goto error;
|
||||
}
|
||||
if (rc < 0)
|
||||
goto err_clnt;
|
||||
|
||||
v9ses->maxdata = v9ses->clnt->msize - P9_IOHDRSZ;
|
||||
|
||||
@ -405,10 +392,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
|
||||
fid = p9_client_attach(v9ses->clnt, NULL, v9ses->uname, INVALID_UID,
|
||||
v9ses->aname);
|
||||
if (IS_ERR(fid)) {
|
||||
retval = PTR_ERR(fid);
|
||||
fid = NULL;
|
||||
rc = PTR_ERR(fid);
|
||||
p9_debug(P9_DEBUG_ERROR, "cannot attach\n");
|
||||
goto error;
|
||||
goto err_clnt;
|
||||
}
|
||||
|
||||
if ((v9ses->flags & V9FS_ACCESS_MASK) == V9FS_ACCESS_SINGLE)
|
||||
@ -420,12 +406,20 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
|
||||
/* register the session for caching */
|
||||
v9fs_cache_session_get_cookie(v9ses);
|
||||
#endif
|
||||
spin_lock(&v9fs_sessionlist_lock);
|
||||
list_add(&v9ses->slist, &v9fs_sessionlist);
|
||||
spin_unlock(&v9fs_sessionlist_lock);
|
||||
|
||||
return fid;
|
||||
|
||||
error:
|
||||
err_clnt:
|
||||
p9_client_destroy(v9ses->clnt);
|
||||
err_bdi:
|
||||
bdi_destroy(&v9ses->bdi);
|
||||
return ERR_PTR(retval);
|
||||
err_names:
|
||||
kfree(v9ses->uname);
|
||||
kfree(v9ses->aname);
|
||||
return ERR_PTR(rc);
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -130,11 +130,7 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
|
||||
fid = v9fs_session_init(v9ses, dev_name, data);
|
||||
if (IS_ERR(fid)) {
|
||||
retval = PTR_ERR(fid);
|
||||
/*
|
||||
* we need to call session_close to tear down some
|
||||
* of the data structure setup by session_init
|
||||
*/
|
||||
goto close_session;
|
||||
goto free_session;
|
||||
}
|
||||
|
||||
sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses);
|
||||
@ -195,8 +191,8 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
|
||||
|
||||
clunk_fid:
|
||||
p9_client_clunk(fid);
|
||||
close_session:
|
||||
v9fs_session_close(v9ses);
|
||||
free_session:
|
||||
kfree(v9ses);
|
||||
return ERR_PTR(retval);
|
||||
|
||||
|
@ -14,6 +14,7 @@
|
||||
#include <linux/device_cgroup.h>
|
||||
#include <linux/highmem.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/blkpg.h>
|
||||
#include <linux/magic.h>
|
||||
@ -546,7 +547,8 @@ static struct file_system_type bd_type = {
|
||||
.kill_sb = kill_anon_super,
|
||||
};
|
||||
|
||||
static struct super_block *blockdev_superblock __read_mostly;
|
||||
struct super_block *blockdev_superblock __read_mostly;
|
||||
EXPORT_SYMBOL_GPL(blockdev_superblock);
|
||||
|
||||
void __init bdev_cache_init(void)
|
||||
{
|
||||
@ -687,11 +689,6 @@ static struct block_device *bd_acquire(struct inode *inode)
|
||||
return bdev;
|
||||
}
|
||||
|
||||
int sb_is_blkdev_sb(struct super_block *sb)
|
||||
{
|
||||
return sb == blockdev_superblock;
|
||||
}
|
||||
|
||||
/* Call when you free inode */
|
||||
|
||||
void bd_forget(struct inode *inode)
|
||||
|
64
fs/buffer.c
64
fs/buffer.c
@ -30,6 +30,7 @@
|
||||
#include <linux/quotaops.h>
|
||||
#include <linux/highmem.h>
|
||||
#include <linux/export.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/writeback.h>
|
||||
#include <linux/hash.h>
|
||||
#include <linux/suspend.h>
|
||||
@ -44,6 +45,9 @@
|
||||
#include <trace/events/block.h>
|
||||
|
||||
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
|
||||
static int submit_bh_wbc(int rw, struct buffer_head *bh,
|
||||
unsigned long bio_flags,
|
||||
struct writeback_control *wbc);
|
||||
|
||||
#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
|
||||
|
||||
@ -623,21 +627,22 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
|
||||
*
|
||||
* If warn is true, then emit a warning if the page is not uptodate and has
|
||||
* not been truncated.
|
||||
*
|
||||
* The caller must hold mem_cgroup_begin_page_stat() lock.
|
||||
*/
|
||||
static void __set_page_dirty(struct page *page,
|
||||
struct address_space *mapping, int warn)
|
||||
static void __set_page_dirty(struct page *page, struct address_space *mapping,
|
||||
struct mem_cgroup *memcg, int warn)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
if (page->mapping) { /* Race with truncate? */
|
||||
WARN_ON_ONCE(warn && !PageUptodate(page));
|
||||
account_page_dirtied(page, mapping);
|
||||
account_page_dirtied(page, mapping, memcg);
|
||||
radix_tree_tag_set(&mapping->page_tree,
|
||||
page_index(page), PAGECACHE_TAG_DIRTY);
|
||||
}
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
||||
}
|
||||
|
||||
/*
|
||||
@ -668,6 +673,7 @@ static void __set_page_dirty(struct page *page,
|
||||
int __set_page_dirty_buffers(struct page *page)
|
||||
{
|
||||
int newly_dirty;
|
||||
struct mem_cgroup *memcg;
|
||||
struct address_space *mapping = page_mapping(page);
|
||||
|
||||
if (unlikely(!mapping))
|
||||
@ -683,11 +689,22 @@ int __set_page_dirty_buffers(struct page *page)
|
||||
bh = bh->b_this_page;
|
||||
} while (bh != head);
|
||||
}
|
||||
/*
|
||||
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
|
||||
* per-memcg dirty page counters.
|
||||
*/
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
newly_dirty = !TestSetPageDirty(page);
|
||||
spin_unlock(&mapping->private_lock);
|
||||
|
||||
if (newly_dirty)
|
||||
__set_page_dirty(page, mapping, 1);
|
||||
__set_page_dirty(page, mapping, memcg, 1);
|
||||
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
|
||||
if (newly_dirty)
|
||||
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
||||
|
||||
return newly_dirty;
|
||||
}
|
||||
EXPORT_SYMBOL(__set_page_dirty_buffers);
|
||||
@ -1158,11 +1175,18 @@ void mark_buffer_dirty(struct buffer_head *bh)
|
||||
|
||||
if (!test_set_buffer_dirty(bh)) {
|
||||
struct page *page = bh->b_page;
|
||||
struct address_space *mapping = NULL;
|
||||
struct mem_cgroup *memcg;
|
||||
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
if (!TestSetPageDirty(page)) {
|
||||
struct address_space *mapping = page_mapping(page);
|
||||
mapping = page_mapping(page);
|
||||
if (mapping)
|
||||
__set_page_dirty(page, mapping, 0);
|
||||
__set_page_dirty(page, mapping, memcg, 0);
|
||||
}
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
if (mapping)
|
||||
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL(mark_buffer_dirty);
|
||||
@ -1684,8 +1708,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
|
||||
struct buffer_head *bh, *head;
|
||||
unsigned int blocksize, bbits;
|
||||
int nr_underway = 0;
|
||||
int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
|
||||
WRITE_SYNC : WRITE);
|
||||
int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
|
||||
|
||||
head = create_page_buffers(page, inode,
|
||||
(1 << BH_Dirty)|(1 << BH_Uptodate));
|
||||
@ -1774,7 +1797,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
|
||||
do {
|
||||
struct buffer_head *next = bh->b_this_page;
|
||||
if (buffer_async_write(bh)) {
|
||||
submit_bh(write_op, bh);
|
||||
submit_bh_wbc(write_op, bh, 0, wbc);
|
||||
nr_underway++;
|
||||
}
|
||||
bh = next;
|
||||
@ -1828,7 +1851,7 @@ recover:
|
||||
struct buffer_head *next = bh->b_this_page;
|
||||
if (buffer_async_write(bh)) {
|
||||
clear_buffer_dirty(bh);
|
||||
submit_bh(write_op, bh);
|
||||
submit_bh_wbc(write_op, bh, 0, wbc);
|
||||
nr_underway++;
|
||||
}
|
||||
bh = next;
|
||||
@ -2993,7 +3016,8 @@ void guard_bio_eod(int rw, struct bio *bio)
|
||||
}
|
||||
}
|
||||
|
||||
int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
|
||||
static int submit_bh_wbc(int rw, struct buffer_head *bh,
|
||||
unsigned long bio_flags, struct writeback_control *wbc)
|
||||
{
|
||||
struct bio *bio;
|
||||
|
||||
@ -3015,6 +3039,11 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
|
||||
*/
|
||||
bio = bio_alloc(GFP_NOIO, 1);
|
||||
|
||||
if (wbc) {
|
||||
wbc_init_bio(wbc, bio);
|
||||
wbc_account_io(wbc, bh->b_page, bh->b_size);
|
||||
}
|
||||
|
||||
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
|
||||
bio->bi_bdev = bh->b_bdev;
|
||||
bio->bi_io_vec[0].bv_page = bh->b_page;
|
||||
@ -3039,11 +3068,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
|
||||
submit_bio(rw, bio);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
|
||||
{
|
||||
return submit_bh_wbc(rw, bh, bio_flags, NULL);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(_submit_bh);
|
||||
|
||||
int submit_bh(int rw, struct buffer_head *bh)
|
||||
{
|
||||
return _submit_bh(rw, bh, 0);
|
||||
return submit_bh_wbc(rw, bh, 0, NULL);
|
||||
}
|
||||
EXPORT_SYMBOL(submit_bh);
|
||||
|
||||
@ -3232,8 +3266,8 @@ int try_to_free_buffers(struct page *page)
|
||||
* to synchronise against __set_page_dirty_buffers and prevent the
|
||||
* dirty bit from being lost.
|
||||
*/
|
||||
if (ret && TestClearPageDirty(page))
|
||||
account_page_cleaned(page, mapping);
|
||||
if (ret)
|
||||
cancel_dirty_page(page);
|
||||
spin_unlock(&mapping->private_lock);
|
||||
out:
|
||||
if (buffers_to_free) {
|
||||
|
@ -882,6 +882,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
|
||||
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
|
||||
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
|
||||
MS_POSIXACL : 0);
|
||||
sb->s_iflags |= SB_I_CGROUPWB;
|
||||
|
||||
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
|
||||
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
|
||||
|
@ -39,6 +39,7 @@
|
||||
#include <linux/slab.h>
|
||||
#include <asm/uaccess.h>
|
||||
#include <linux/fiemap.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include "ext4_jbd2.h"
|
||||
#include "ext4_extents.h"
|
||||
#include "xattr.h"
|
||||
|
@ -26,6 +26,7 @@
|
||||
#include <linux/log2.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <trace/events/ext4.h>
|
||||
|
||||
#ifdef CONFIG_EXT4_DEBUG
|
||||
|
@ -24,6 +24,7 @@
|
||||
#include <linux/slab.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/parser.h>
|
||||
#include <linux/buffer_head.h>
|
||||
#include <linux/exportfs.h>
|
||||
|
@ -53,7 +53,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
|
||||
PAGE_CACHE_SHIFT;
|
||||
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2);
|
||||
} else if (type == DIRTY_DENTS) {
|
||||
if (sbi->sb->s_bdi->dirty_exceeded)
|
||||
if (sbi->sb->s_bdi->wb.dirty_exceeded)
|
||||
return false;
|
||||
mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
|
||||
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
|
||||
@ -70,7 +70,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
|
||||
sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT;
|
||||
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
|
||||
} else {
|
||||
if (sbi->sb->s_bdi->dirty_exceeded)
|
||||
if (sbi->sb->s_bdi->wb.dirty_exceeded)
|
||||
return false;
|
||||
}
|
||||
return res;
|
||||
|
@ -9,6 +9,7 @@
|
||||
* published by the Free Software Foundation.
|
||||
*/
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
|
||||
/* constant macro */
|
||||
#define NULL_SEGNO ((unsigned int)(~0))
|
||||
@ -714,7 +715,7 @@ static inline unsigned int max_hw_blocks(struct f2fs_sb_info *sbi)
|
||||
*/
|
||||
static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
|
||||
{
|
||||
if (sbi->sb->s_bdi->dirty_exceeded)
|
||||
if (sbi->sb->s_bdi->wb.dirty_exceeded)
|
||||
return 0;
|
||||
|
||||
if (type == DATA)
|
||||
|
@ -11,6 +11,7 @@
|
||||
#include <linux/compat.h>
|
||||
#include <linux/mount.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/fsnotify.h>
|
||||
#include <linux/security.h>
|
||||
#include "fat.h"
|
||||
|
@ -18,6 +18,7 @@
|
||||
#include <linux/parser.h>
|
||||
#include <linux/uio.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <asm/unaligned.h>
|
||||
#include "fat.h"
|
||||
|
||||
|
1167
fs/fs-writeback.c
1167
fs/fs-writeback.c
File diff suppressed because it is too large
Load Diff
@ -1445,9 +1445,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
|
||||
|
||||
list_del(&req->writepages_entry);
|
||||
for (i = 0; i < req->num_pages; i++) {
|
||||
dec_bdi_stat(bdi, BDI_WRITEBACK);
|
||||
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
|
||||
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
|
||||
bdi_writeout_inc(bdi);
|
||||
wb_writeout_inc(&bdi->wb);
|
||||
}
|
||||
wake_up(&fi->page_waitq);
|
||||
}
|
||||
@ -1634,7 +1634,7 @@ static int fuse_writepage_locked(struct page *page)
|
||||
req->end = fuse_writepage_end;
|
||||
req->inode = inode;
|
||||
|
||||
inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
|
||||
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
|
||||
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
|
||||
|
||||
spin_lock(&fc->lock);
|
||||
@ -1749,9 +1749,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
|
||||
copy_highpage(old_req->pages[0], page);
|
||||
spin_unlock(&fc->lock);
|
||||
|
||||
dec_bdi_stat(bdi, BDI_WRITEBACK);
|
||||
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
|
||||
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
|
||||
bdi_writeout_inc(bdi);
|
||||
wb_writeout_inc(&bdi->wb);
|
||||
fuse_writepage_free(fc, new_req);
|
||||
fuse_request_free(new_req);
|
||||
goto out;
|
||||
@ -1848,7 +1848,7 @@ static int fuse_writepages_fill(struct page *page,
|
||||
req->page_descs[req->num_pages].offset = 0;
|
||||
req->page_descs[req->num_pages].length = PAGE_SIZE;
|
||||
|
||||
inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
|
||||
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
|
||||
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
|
||||
|
||||
err = 0;
|
||||
|
@ -748,7 +748,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
|
||||
|
||||
if (wbc->sync_mode == WB_SYNC_ALL)
|
||||
gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH);
|
||||
if (bdi->dirty_exceeded)
|
||||
if (bdi->wb.dirty_exceeded)
|
||||
gfs2_ail1_flush(sdp, wbc);
|
||||
else
|
||||
filemap_fdatawrite(metamapping);
|
||||
|
@ -14,6 +14,7 @@
|
||||
|
||||
#include <linux/module.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/mount.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/nls.h>
|
||||
|
@ -11,6 +11,7 @@
|
||||
#include <linux/init.h>
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/vfs.h>
|
||||
|
@ -224,6 +224,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
|
||||
void __destroy_inode(struct inode *inode)
|
||||
{
|
||||
BUG_ON(inode_has_buffers(inode));
|
||||
inode_detach_wb(inode);
|
||||
security_inode_free(inode);
|
||||
fsnotify_inode_delete(inode);
|
||||
locks_free_lock_context(inode->i_flctx);
|
||||
|
@ -605,6 +605,8 @@ alloc_new:
|
||||
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
|
||||
if (bio == NULL)
|
||||
goto confused;
|
||||
|
||||
wbc_init_bio(wbc, bio);
|
||||
}
|
||||
|
||||
/*
|
||||
@ -612,6 +614,7 @@ alloc_new:
|
||||
* the confused fail path above (OOM) will be very confused when
|
||||
* it finds all bh marked clean (i.e. it will not write anything)
|
||||
*/
|
||||
wbc_account_io(wbc, page, PAGE_SIZE);
|
||||
length = first_unmapped << blkbits;
|
||||
if (bio_add_page(bio, page, length, 0) < length) {
|
||||
bio = mpage_bio_submit(WRITE, bio);
|
||||
|
@ -32,6 +32,7 @@
|
||||
#include <linux/nfs_fs.h>
|
||||
#include <linux/nfs_page.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/backing-dev.h>
|
||||
|
||||
#include <linux/sunrpc/metrics.h>
|
||||
|
||||
|
@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page)
|
||||
struct inode *inode = page_file_mapping(page)->host;
|
||||
|
||||
inc_zone_page_state(page, NR_UNSTABLE_NFS);
|
||||
inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE);
|
||||
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
|
||||
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
|
||||
}
|
||||
|
||||
|
@ -853,7 +853,8 @@ static void
|
||||
nfs_clear_page_commit(struct page *page)
|
||||
{
|
||||
dec_zone_page_state(page, NR_UNSTABLE_NFS);
|
||||
dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE);
|
||||
dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
|
||||
WB_RECLAIMABLE);
|
||||
}
|
||||
|
||||
/* Called holding inode (/cinfo) lock */
|
||||
|
@ -37,6 +37,7 @@
|
||||
#include <linux/falloc.h>
|
||||
#include <linux/quotaops.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
|
||||
#include <cluster/masklog.h>
|
||||
|
||||
|
@ -21,6 +21,7 @@
|
||||
#include "xattr.h"
|
||||
#include <linux/init.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/buffer_head.h>
|
||||
#include <linux/exportfs.h>
|
||||
#include <linux/quotaops.h>
|
||||
|
@ -80,6 +80,7 @@
|
||||
#include <linux/stat.h>
|
||||
#include <linux/string.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/init.h>
|
||||
#include <linux/parser.h>
|
||||
#include <linux/buffer_head.h>
|
||||
|
@ -1873,6 +1873,7 @@ xfs_vm_set_page_dirty(
|
||||
loff_t end_offset;
|
||||
loff_t offset;
|
||||
int newly_dirty;
|
||||
struct mem_cgroup *memcg;
|
||||
|
||||
if (unlikely(!mapping))
|
||||
return !TestSetPageDirty(page);
|
||||
@ -1892,6 +1893,11 @@ xfs_vm_set_page_dirty(
|
||||
offset += 1 << inode->i_blkbits;
|
||||
} while (bh != head);
|
||||
}
|
||||
/*
|
||||
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
|
||||
* per-memcg dirty page counters.
|
||||
*/
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
newly_dirty = !TestSetPageDirty(page);
|
||||
spin_unlock(&mapping->private_lock);
|
||||
|
||||
@ -1902,13 +1908,15 @@ xfs_vm_set_page_dirty(
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
if (page->mapping) { /* Race with truncate? */
|
||||
WARN_ON_ONCE(!PageUptodate(page));
|
||||
account_page_dirtied(page, mapping);
|
||||
account_page_dirtied(page, mapping, memcg);
|
||||
radix_tree_tag_set(&mapping->page_tree,
|
||||
page_index(page), PAGECACHE_TAG_DIRTY);
|
||||
}
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
||||
}
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
if (newly_dirty)
|
||||
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
||||
return newly_dirty;
|
||||
}
|
||||
|
||||
|
@ -41,6 +41,7 @@
|
||||
#include <linux/dcache.h>
|
||||
#include <linux/falloc.h>
|
||||
#include <linux/pagevec.h>
|
||||
#include <linux/backing-dev.h>
|
||||
|
||||
static const struct vm_operations_struct xfs_file_vm_ops;
|
||||
|
||||
|
255
include/linux/backing-dev-defs.h
Normal file
255
include/linux/backing-dev-defs.h
Normal file
@ -0,0 +1,255 @@
|
||||
#ifndef __LINUX_BACKING_DEV_DEFS_H
|
||||
#define __LINUX_BACKING_DEV_DEFS_H
|
||||
|
||||
#include <linux/list.h>
|
||||
#include <linux/radix-tree.h>
|
||||
#include <linux/rbtree.h>
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/percpu_counter.h>
|
||||
#include <linux/percpu-refcount.h>
|
||||
#include <linux/flex_proportions.h>
|
||||
#include <linux/timer.h>
|
||||
#include <linux/workqueue.h>
|
||||
|
||||
struct page;
|
||||
struct device;
|
||||
struct dentry;
|
||||
|
||||
/*
|
||||
* Bits in bdi_writeback.state
|
||||
*/
|
||||
enum wb_state {
|
||||
WB_registered, /* bdi_register() was done */
|
||||
WB_writeback_running, /* Writeback is in progress */
|
||||
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
|
||||
};
|
||||
|
||||
enum wb_congested_state {
|
||||
WB_async_congested, /* The async (write) queue is getting full */
|
||||
WB_sync_congested, /* The sync queue is getting full */
|
||||
};
|
||||
|
||||
typedef int (congested_fn)(void *, int);
|
||||
|
||||
enum wb_stat_item {
|
||||
WB_RECLAIMABLE,
|
||||
WB_WRITEBACK,
|
||||
WB_DIRTIED,
|
||||
WB_WRITTEN,
|
||||
NR_WB_STAT_ITEMS
|
||||
};
|
||||
|
||||
#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
|
||||
|
||||
/*
|
||||
* For cgroup writeback, multiple wb's may map to the same blkcg. Those
|
||||
* wb's can operate mostly independently but should share the congested
|
||||
* state. To facilitate such sharing, the congested state is tracked using
|
||||
* the following struct which is created on demand, indexed by blkcg ID on
|
||||
* its bdi, and refcounted.
|
||||
*/
|
||||
struct bdi_writeback_congested {
|
||||
unsigned long state; /* WB_[a]sync_congested flags */
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct backing_dev_info *bdi; /* the associated bdi */
|
||||
atomic_t refcnt; /* nr of attached wb's and blkg */
|
||||
int blkcg_id; /* ID of the associated blkcg */
|
||||
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
|
||||
#endif
|
||||
};
|
||||
|
||||
/*
|
||||
* Each wb (bdi_writeback) can perform writeback operations, is measured
|
||||
* and throttled, independently. Without cgroup writeback, each bdi
|
||||
* (bdi_writeback) is served by its embedded bdi->wb.
|
||||
*
|
||||
* On the default hierarchy, blkcg implicitly enables memcg. This allows
|
||||
* using memcg's page ownership for attributing writeback IOs, and every
|
||||
* memcg - blkcg combination can be served by its own wb by assigning a
|
||||
* dedicated wb to each memcg, which enables isolation across different
|
||||
* cgroups and propagation of IO back pressure down from the IO layer upto
|
||||
* the tasks which are generating the dirty pages to be written back.
|
||||
*
|
||||
* A cgroup wb is indexed on its bdi by the ID of the associated memcg,
|
||||
* refcounted with the number of inodes attached to it, and pins the memcg
|
||||
* and the corresponding blkcg. As the corresponding blkcg for a memcg may
|
||||
* change as blkcg is disabled and enabled higher up in the hierarchy, a wb
|
||||
* is tested for blkcg after lookup and removed from index on mismatch so
|
||||
* that a new wb for the combination can be created.
|
||||
*/
|
||||
struct bdi_writeback {
|
||||
struct backing_dev_info *bdi; /* our parent bdi */
|
||||
|
||||
unsigned long state; /* Always use atomic bitops on this */
|
||||
unsigned long last_old_flush; /* last old data flush */
|
||||
|
||||
struct list_head b_dirty; /* dirty inodes */
|
||||
struct list_head b_io; /* parked for writeback */
|
||||
struct list_head b_more_io; /* parked for more writeback */
|
||||
struct list_head b_dirty_time; /* time stamps are dirty */
|
||||
spinlock_t list_lock; /* protects the b_* lists */
|
||||
|
||||
struct percpu_counter stat[NR_WB_STAT_ITEMS];
|
||||
|
||||
struct bdi_writeback_congested *congested;
|
||||
|
||||
unsigned long bw_time_stamp; /* last time write bw is updated */
|
||||
unsigned long dirtied_stamp;
|
||||
unsigned long written_stamp; /* pages written at bw_time_stamp */
|
||||
unsigned long write_bandwidth; /* the estimated write bandwidth */
|
||||
unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */
|
||||
|
||||
/*
|
||||
* The base dirty throttle rate, re-calculated on every 200ms.
|
||||
* All the bdi tasks' dirty rate will be curbed under it.
|
||||
* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
|
||||
* in small steps and is much more smooth/stable than the latter.
|
||||
*/
|
||||
unsigned long dirty_ratelimit;
|
||||
unsigned long balanced_dirty_ratelimit;
|
||||
|
||||
struct fprop_local_percpu completions;
|
||||
int dirty_exceeded;
|
||||
|
||||
spinlock_t work_lock; /* protects work_list & dwork scheduling */
|
||||
struct list_head work_list;
|
||||
struct delayed_work dwork; /* work item used for writeback */
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct percpu_ref refcnt; /* used only for !root wb's */
|
||||
struct fprop_local_percpu memcg_completions;
|
||||
struct cgroup_subsys_state *memcg_css; /* the associated memcg */
|
||||
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
|
||||
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
|
||||
struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
|
||||
|
||||
union {
|
||||
struct work_struct release_work;
|
||||
struct rcu_head rcu;
|
||||
};
|
||||
#endif
|
||||
};
|
||||
|
||||
struct backing_dev_info {
|
||||
struct list_head bdi_list;
|
||||
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
|
||||
unsigned int capabilities; /* Device capabilities */
|
||||
congested_fn *congested_fn; /* Function pointer if device is md/dm */
|
||||
void *congested_data; /* Pointer to aux data for congested func */
|
||||
|
||||
char *name;
|
||||
|
||||
unsigned int min_ratio;
|
||||
unsigned int max_ratio, max_prop_frac;
|
||||
|
||||
/*
|
||||
* Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are
|
||||
* any dirty wbs, which is depended upon by bdi_has_dirty().
|
||||
*/
|
||||
atomic_long_t tot_write_bandwidth;
|
||||
|
||||
struct bdi_writeback wb; /* the root writeback info for this bdi */
|
||||
struct bdi_writeback_congested wb_congested; /* its congested state */
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
|
||||
struct rb_root cgwb_congested_tree; /* their congested states */
|
||||
atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
|
||||
#endif
|
||||
wait_queue_head_t wb_waitq;
|
||||
|
||||
struct device *dev;
|
||||
|
||||
struct timer_list laptop_mode_wb_timer;
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debug_dir;
|
||||
struct dentry *debug_stats;
|
||||
#endif
|
||||
};
|
||||
|
||||
enum {
|
||||
BLK_RW_ASYNC = 0,
|
||||
BLK_RW_SYNC = 1,
|
||||
};
|
||||
|
||||
void clear_wb_congested(struct bdi_writeback_congested *congested, int sync);
|
||||
void set_wb_congested(struct bdi_writeback_congested *congested, int sync);
|
||||
|
||||
static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
|
||||
{
|
||||
clear_wb_congested(bdi->wb.congested, sync);
|
||||
}
|
||||
|
||||
static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
|
||||
{
|
||||
set_wb_congested(bdi->wb.congested, sync);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
/**
|
||||
* wb_tryget - try to increment a wb's refcount
|
||||
* @wb: bdi_writeback to get
|
||||
*/
|
||||
static inline bool wb_tryget(struct bdi_writeback *wb)
|
||||
{
|
||||
if (wb != &wb->bdi->wb)
|
||||
return percpu_ref_tryget(&wb->refcnt);
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_get - increment a wb's refcount
|
||||
* @wb: bdi_writeback to get
|
||||
*/
|
||||
static inline void wb_get(struct bdi_writeback *wb)
|
||||
{
|
||||
if (wb != &wb->bdi->wb)
|
||||
percpu_ref_get(&wb->refcnt);
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_put - decrement a wb's refcount
|
||||
* @wb: bdi_writeback to put
|
||||
*/
|
||||
static inline void wb_put(struct bdi_writeback *wb)
|
||||
{
|
||||
if (wb != &wb->bdi->wb)
|
||||
percpu_ref_put(&wb->refcnt);
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_dying - is a wb dying?
|
||||
* @wb: bdi_writeback of interest
|
||||
*
|
||||
* Returns whether @wb is unlinked and being drained.
|
||||
*/
|
||||
static inline bool wb_dying(struct bdi_writeback *wb)
|
||||
{
|
||||
return percpu_ref_is_dying(&wb->refcnt);
|
||||
}
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static inline bool wb_tryget(struct bdi_writeback *wb)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
static inline void wb_get(struct bdi_writeback *wb)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wb_put(struct bdi_writeback *wb)
|
||||
{
|
||||
}
|
||||
|
||||
static inline bool wb_dying(struct bdi_writeback *wb)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
#endif /* __LINUX_BACKING_DEV_DEFS_H */
|
@ -8,106 +8,13 @@
|
||||
#ifndef _LINUX_BACKING_DEV_H
|
||||
#define _LINUX_BACKING_DEV_H
|
||||
|
||||
#include <linux/percpu_counter.h>
|
||||
#include <linux/log2.h>
|
||||
#include <linux/flex_proportions.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/sched.h>
|
||||
#include <linux/timer.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/writeback.h>
|
||||
#include <linux/atomic.h>
|
||||
#include <linux/sysctl.h>
|
||||
#include <linux/workqueue.h>
|
||||
|
||||
struct page;
|
||||
struct device;
|
||||
struct dentry;
|
||||
|
||||
/*
|
||||
* Bits in backing_dev_info.state
|
||||
*/
|
||||
enum bdi_state {
|
||||
BDI_async_congested, /* The async (write) queue is getting full */
|
||||
BDI_sync_congested, /* The sync queue is getting full */
|
||||
BDI_registered, /* bdi_register() was done */
|
||||
BDI_writeback_running, /* Writeback is in progress */
|
||||
};
|
||||
|
||||
typedef int (congested_fn)(void *, int);
|
||||
|
||||
enum bdi_stat_item {
|
||||
BDI_RECLAIMABLE,
|
||||
BDI_WRITEBACK,
|
||||
BDI_DIRTIED,
|
||||
BDI_WRITTEN,
|
||||
NR_BDI_STAT_ITEMS
|
||||
};
|
||||
|
||||
#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
|
||||
|
||||
struct bdi_writeback {
|
||||
struct backing_dev_info *bdi; /* our parent bdi */
|
||||
|
||||
unsigned long last_old_flush; /* last old data flush */
|
||||
|
||||
struct delayed_work dwork; /* work item used for writeback */
|
||||
struct list_head b_dirty; /* dirty inodes */
|
||||
struct list_head b_io; /* parked for writeback */
|
||||
struct list_head b_more_io; /* parked for more writeback */
|
||||
struct list_head b_dirty_time; /* time stamps are dirty */
|
||||
spinlock_t list_lock; /* protects the b_* lists */
|
||||
};
|
||||
|
||||
struct backing_dev_info {
|
||||
struct list_head bdi_list;
|
||||
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
|
||||
unsigned long state; /* Always use atomic bitops on this */
|
||||
unsigned int capabilities; /* Device capabilities */
|
||||
congested_fn *congested_fn; /* Function pointer if device is md/dm */
|
||||
void *congested_data; /* Pointer to aux data for congested func */
|
||||
|
||||
char *name;
|
||||
|
||||
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
|
||||
|
||||
unsigned long bw_time_stamp; /* last time write bw is updated */
|
||||
unsigned long dirtied_stamp;
|
||||
unsigned long written_stamp; /* pages written at bw_time_stamp */
|
||||
unsigned long write_bandwidth; /* the estimated write bandwidth */
|
||||
unsigned long avg_write_bandwidth; /* further smoothed write bw */
|
||||
|
||||
/*
|
||||
* The base dirty throttle rate, re-calculated on every 200ms.
|
||||
* All the bdi tasks' dirty rate will be curbed under it.
|
||||
* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
|
||||
* in small steps and is much more smooth/stable than the latter.
|
||||
*/
|
||||
unsigned long dirty_ratelimit;
|
||||
unsigned long balanced_dirty_ratelimit;
|
||||
|
||||
struct fprop_local_percpu completions;
|
||||
int dirty_exceeded;
|
||||
|
||||
unsigned int min_ratio;
|
||||
unsigned int max_ratio, max_prop_frac;
|
||||
|
||||
struct bdi_writeback wb; /* default writeback info for this bdi */
|
||||
spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */
|
||||
|
||||
struct list_head work_list;
|
||||
|
||||
struct device *dev;
|
||||
|
||||
struct timer_list laptop_mode_wb_timer;
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debug_dir;
|
||||
struct dentry *debug_stats;
|
||||
#endif
|
||||
};
|
||||
|
||||
struct backing_dev_info *inode_to_bdi(struct inode *inode);
|
||||
#include <linux/blk-cgroup.h>
|
||||
#include <linux/backing-dev-defs.h>
|
||||
|
||||
int __must_check bdi_init(struct backing_dev_info *bdi);
|
||||
void bdi_destroy(struct backing_dev_info *bdi);
|
||||
@ -117,97 +24,99 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
|
||||
const char *fmt, ...);
|
||||
int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
|
||||
int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
|
||||
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
|
||||
enum wb_reason reason);
|
||||
void bdi_start_background_writeback(struct backing_dev_info *bdi);
|
||||
void bdi_writeback_workfn(struct work_struct *work);
|
||||
int bdi_has_dirty_io(struct backing_dev_info *bdi);
|
||||
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
|
||||
void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
|
||||
bool range_cyclic, enum wb_reason reason);
|
||||
void wb_start_background_writeback(struct bdi_writeback *wb);
|
||||
void wb_workfn(struct work_struct *work);
|
||||
void wb_wakeup_delayed(struct bdi_writeback *wb);
|
||||
|
||||
extern spinlock_t bdi_lock;
|
||||
extern struct list_head bdi_list;
|
||||
|
||||
extern struct workqueue_struct *bdi_wq;
|
||||
|
||||
static inline int wb_has_dirty_io(struct bdi_writeback *wb)
|
||||
static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
|
||||
{
|
||||
return !list_empty(&wb->b_dirty) ||
|
||||
!list_empty(&wb->b_io) ||
|
||||
!list_empty(&wb->b_more_io);
|
||||
return test_bit(WB_has_dirty_io, &wb->state);
|
||||
}
|
||||
|
||||
static inline void __add_bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item, s64 amount)
|
||||
static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
|
||||
{
|
||||
__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
|
||||
/*
|
||||
* @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
|
||||
* any dirty wbs. See wb_update_write_bandwidth().
|
||||
*/
|
||||
return atomic_long_read(&bdi->tot_write_bandwidth);
|
||||
}
|
||||
|
||||
static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline void __add_wb_stat(struct bdi_writeback *wb,
|
||||
enum wb_stat_item item, s64 amount)
|
||||
{
|
||||
__add_bdi_stat(bdi, item, 1);
|
||||
__percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
|
||||
}
|
||||
|
||||
static inline void inc_bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline void __inc_wb_stat(struct bdi_writeback *wb,
|
||||
enum wb_stat_item item)
|
||||
{
|
||||
__add_wb_stat(wb, item, 1);
|
||||
}
|
||||
|
||||
static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
local_irq_save(flags);
|
||||
__inc_bdi_stat(bdi, item);
|
||||
__inc_wb_stat(wb, item);
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
|
||||
static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline void __dec_wb_stat(struct bdi_writeback *wb,
|
||||
enum wb_stat_item item)
|
||||
{
|
||||
__add_bdi_stat(bdi, item, -1);
|
||||
__add_wb_stat(wb, item, -1);
|
||||
}
|
||||
|
||||
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
local_irq_save(flags);
|
||||
__dec_bdi_stat(bdi, item);
|
||||
__dec_wb_stat(wb, item);
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
|
||||
static inline s64 bdi_stat(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
|
||||
{
|
||||
return percpu_counter_read_positive(&bdi->bdi_stat[item]);
|
||||
return percpu_counter_read_positive(&wb->stat[item]);
|
||||
}
|
||||
|
||||
static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
|
||||
enum wb_stat_item item)
|
||||
{
|
||||
return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
|
||||
return percpu_counter_sum_positive(&wb->stat[item]);
|
||||
}
|
||||
|
||||
static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
|
||||
enum bdi_stat_item item)
|
||||
static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
|
||||
{
|
||||
s64 sum;
|
||||
unsigned long flags;
|
||||
|
||||
local_irq_save(flags);
|
||||
sum = __bdi_stat_sum(bdi, item);
|
||||
sum = __wb_stat_sum(wb, item);
|
||||
local_irq_restore(flags);
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
extern void bdi_writeout_inc(struct backing_dev_info *bdi);
|
||||
extern void wb_writeout_inc(struct bdi_writeback *wb);
|
||||
|
||||
/*
|
||||
* maximal error of a stat counter.
|
||||
*/
|
||||
static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
|
||||
static inline unsigned long wb_stat_error(struct bdi_writeback *wb)
|
||||
{
|
||||
#ifdef CONFIG_SMP
|
||||
return nr_cpu_ids * BDI_STAT_BATCH;
|
||||
return nr_cpu_ids * WB_STAT_BATCH;
|
||||
#else
|
||||
return 1;
|
||||
#endif
|
||||
@ -231,50 +140,57 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
|
||||
* BDI_CAP_NO_WRITEBACK: Don't write pages back
|
||||
* BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
|
||||
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
|
||||
*
|
||||
* BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
|
||||
*/
|
||||
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
|
||||
#define BDI_CAP_NO_WRITEBACK 0x00000002
|
||||
#define BDI_CAP_NO_ACCT_WB 0x00000004
|
||||
#define BDI_CAP_STABLE_WRITES 0x00000008
|
||||
#define BDI_CAP_STRICTLIMIT 0x00000010
|
||||
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
|
||||
|
||||
#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
|
||||
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
|
||||
|
||||
extern struct backing_dev_info noop_backing_dev_info;
|
||||
|
||||
int writeback_in_progress(struct backing_dev_info *bdi);
|
||||
|
||||
static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
|
||||
/**
|
||||
* writeback_in_progress - determine whether there is writeback in progress
|
||||
* @wb: bdi_writeback of interest
|
||||
*
|
||||
* Determine whether there is writeback waiting to be handled against a
|
||||
* bdi_writeback.
|
||||
*/
|
||||
static inline bool writeback_in_progress(struct bdi_writeback *wb)
|
||||
{
|
||||
return test_bit(WB_writeback_running, &wb->state);
|
||||
}
|
||||
|
||||
static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
|
||||
{
|
||||
struct super_block *sb;
|
||||
|
||||
if (!inode)
|
||||
return &noop_backing_dev_info;
|
||||
|
||||
sb = inode->i_sb;
|
||||
#ifdef CONFIG_BLOCK
|
||||
if (sb_is_blkdev_sb(sb))
|
||||
return blk_get_backing_dev_info(I_BDEV(inode));
|
||||
#endif
|
||||
return sb->s_bdi;
|
||||
}
|
||||
|
||||
static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
|
||||
{
|
||||
struct backing_dev_info *bdi = wb->bdi;
|
||||
|
||||
if (bdi->congested_fn)
|
||||
return bdi->congested_fn(bdi->congested_data, bdi_bits);
|
||||
return (bdi->state & bdi_bits);
|
||||
return bdi->congested_fn(bdi->congested_data, cong_bits);
|
||||
return wb->congested->state & cong_bits;
|
||||
}
|
||||
|
||||
static inline int bdi_read_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, 1 << BDI_sync_congested);
|
||||
}
|
||||
|
||||
static inline int bdi_write_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, 1 << BDI_async_congested);
|
||||
}
|
||||
|
||||
static inline int bdi_rw_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, (1 << BDI_sync_congested) |
|
||||
(1 << BDI_async_congested));
|
||||
}
|
||||
|
||||
enum {
|
||||
BLK_RW_ASYNC = 0,
|
||||
BLK_RW_SYNC = 1,
|
||||
};
|
||||
|
||||
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
|
||||
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
|
||||
long congestion_wait(int sync, long timeout);
|
||||
long wait_iff_congested(struct zone *zone, int sync, long timeout);
|
||||
int pdflush_proc_obsolete(struct ctl_table *table, int write,
|
||||
@ -318,4 +234,333 @@ static inline int bdi_sched_wait(void *word)
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif /* _LINUX_BACKING_DEV_H */
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
struct bdi_writeback_congested *
|
||||
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp);
|
||||
void wb_congested_put(struct bdi_writeback_congested *congested);
|
||||
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
|
||||
struct cgroup_subsys_state *memcg_css,
|
||||
gfp_t gfp);
|
||||
void wb_memcg_offline(struct mem_cgroup *memcg);
|
||||
void wb_blkcg_offline(struct blkcg *blkcg);
|
||||
int inode_congested(struct inode *inode, int cong_bits);
|
||||
|
||||
/**
|
||||
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
|
||||
* @inode: inode of interest
|
||||
*
|
||||
* cgroup writeback requires support from both the bdi and filesystem.
|
||||
* Test whether @inode has both.
|
||||
*/
|
||||
static inline bool inode_cgwb_enabled(struct inode *inode)
|
||||
{
|
||||
struct backing_dev_info *bdi = inode_to_bdi(inode);
|
||||
|
||||
return bdi_cap_account_dirty(bdi) &&
|
||||
(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
|
||||
(inode->i_sb->s_iflags & SB_I_CGROUPWB);
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_find_current - find wb for %current on a bdi
|
||||
* @bdi: bdi of interest
|
||||
*
|
||||
* Find the wb of @bdi which matches both the memcg and blkcg of %current.
|
||||
* Must be called under rcu_read_lock() which protects the returend wb.
|
||||
* NULL if not found.
|
||||
*/
|
||||
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
|
||||
{
|
||||
struct cgroup_subsys_state *memcg_css;
|
||||
struct bdi_writeback *wb;
|
||||
|
||||
memcg_css = task_css(current, memory_cgrp_id);
|
||||
if (!memcg_css->parent)
|
||||
return &bdi->wb;
|
||||
|
||||
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
||||
|
||||
/*
|
||||
* %current's blkcg equals the effective blkcg of its memcg. No
|
||||
* need to use the relatively expensive cgroup_get_e_css().
|
||||
*/
|
||||
if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id)))
|
||||
return wb;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_get_create_current - get or create wb for %current on a bdi
|
||||
* @bdi: bdi of interest
|
||||
* @gfp: allocation mask
|
||||
*
|
||||
* Equivalent to wb_get_create() on %current's memcg. This function is
|
||||
* called from a relatively hot path and optimizes the common cases using
|
||||
* wb_find_current().
|
||||
*/
|
||||
static inline struct bdi_writeback *
|
||||
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
|
||||
{
|
||||
struct bdi_writeback *wb;
|
||||
|
||||
rcu_read_lock();
|
||||
wb = wb_find_current(bdi);
|
||||
if (wb && unlikely(!wb_tryget(wb)))
|
||||
wb = NULL;
|
||||
rcu_read_unlock();
|
||||
|
||||
if (unlikely(!wb)) {
|
||||
struct cgroup_subsys_state *memcg_css;
|
||||
|
||||
memcg_css = task_get_css(current, memory_cgrp_id);
|
||||
wb = wb_get_create(bdi, memcg_css, gfp);
|
||||
css_put(memcg_css);
|
||||
}
|
||||
return wb;
|
||||
}
|
||||
|
||||
/**
|
||||
* inode_to_wb_is_valid - test whether an inode has a wb associated
|
||||
* @inode: inode of interest
|
||||
*
|
||||
* Returns %true if @inode has a wb associated. May be called without any
|
||||
* locking.
|
||||
*/
|
||||
static inline bool inode_to_wb_is_valid(struct inode *inode)
|
||||
{
|
||||
return inode->i_wb;
|
||||
}
|
||||
|
||||
/**
|
||||
* inode_to_wb - determine the wb of an inode
|
||||
* @inode: inode of interest
|
||||
*
|
||||
* Returns the wb @inode is currently associated with. The caller must be
|
||||
* holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the
|
||||
* associated wb's list_lock.
|
||||
*/
|
||||
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
|
||||
{
|
||||
#ifdef CONFIG_LOCKDEP
|
||||
WARN_ON_ONCE(debug_locks &&
|
||||
(!lockdep_is_held(&inode->i_lock) &&
|
||||
!lockdep_is_held(&inode->i_mapping->tree_lock) &&
|
||||
!lockdep_is_held(&inode->i_wb->list_lock)));
|
||||
#endif
|
||||
return inode->i_wb;
|
||||
}
|
||||
|
||||
/**
|
||||
* unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
|
||||
* @inode: target inode
|
||||
* @lockedp: temp bool output param, to be passed to the end function
|
||||
*
|
||||
* The caller wants to access the wb associated with @inode but isn't
|
||||
* holding inode->i_lock, mapping->tree_lock or wb->list_lock. This
|
||||
* function determines the wb associated with @inode and ensures that the
|
||||
* association doesn't change until the transaction is finished with
|
||||
* unlocked_inode_to_wb_end().
|
||||
*
|
||||
* The caller must call unlocked_inode_to_wb_end() with *@lockdep
|
||||
* afterwards and can't sleep during transaction. IRQ may or may not be
|
||||
* disabled on return.
|
||||
*/
|
||||
static inline struct bdi_writeback *
|
||||
unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
|
||||
{
|
||||
rcu_read_lock();
|
||||
|
||||
/*
|
||||
* Paired with store_release in inode_switch_wb_work_fn() and
|
||||
* ensures that we see the new wb if we see cleared I_WB_SWITCH.
|
||||
*/
|
||||
*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
|
||||
|
||||
if (unlikely(*lockedp))
|
||||
spin_lock_irq(&inode->i_mapping->tree_lock);
|
||||
|
||||
/*
|
||||
* Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
|
||||
* inode_to_wb() will bark. Deref directly.
|
||||
*/
|
||||
return inode->i_wb;
|
||||
}
|
||||
|
||||
/**
|
||||
* unlocked_inode_to_wb_end - end inode wb access transaction
|
||||
* @inode: target inode
|
||||
* @locked: *@lockedp from unlocked_inode_to_wb_begin()
|
||||
*/
|
||||
static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
|
||||
{
|
||||
if (unlikely(locked))
|
||||
spin_unlock_irq(&inode->i_mapping->tree_lock);
|
||||
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
struct wb_iter {
|
||||
int start_blkcg_id;
|
||||
struct radix_tree_iter tree_iter;
|
||||
void **slot;
|
||||
};
|
||||
|
||||
static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
|
||||
struct backing_dev_info *bdi)
|
||||
{
|
||||
struct radix_tree_iter *titer = &iter->tree_iter;
|
||||
|
||||
WARN_ON_ONCE(!rcu_read_lock_held());
|
||||
|
||||
if (iter->start_blkcg_id >= 0) {
|
||||
iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id);
|
||||
iter->start_blkcg_id = -1;
|
||||
} else {
|
||||
iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
|
||||
}
|
||||
|
||||
if (!iter->slot)
|
||||
iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
|
||||
if (iter->slot)
|
||||
return *iter->slot;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
|
||||
struct backing_dev_info *bdi,
|
||||
int start_blkcg_id)
|
||||
{
|
||||
iter->start_blkcg_id = start_blkcg_id;
|
||||
|
||||
if (start_blkcg_id)
|
||||
return __wb_iter_next(iter, bdi);
|
||||
else
|
||||
return &bdi->wb;
|
||||
}
|
||||
|
||||
/**
|
||||
* bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order
|
||||
* @wb_cur: cursor struct bdi_writeback pointer
|
||||
* @bdi: bdi to walk wb's of
|
||||
* @iter: pointer to struct wb_iter to be used as iteration buffer
|
||||
* @start_blkcg_id: blkcg ID to start iteration from
|
||||
*
|
||||
* Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
|
||||
* blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter
|
||||
* to be used as temp storage during iteration. rcu_read_lock() must be
|
||||
* held throughout iteration.
|
||||
*/
|
||||
#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
|
||||
for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \
|
||||
(wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static inline bool inode_cgwb_enabled(struct inode *inode)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback_congested *
|
||||
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
|
||||
{
|
||||
return bdi->wb.congested;
|
||||
}
|
||||
|
||||
static inline void wb_congested_put(struct bdi_writeback_congested *congested)
|
||||
{
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
|
||||
{
|
||||
return &bdi->wb;
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback *
|
||||
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
|
||||
{
|
||||
return &bdi->wb;
|
||||
}
|
||||
|
||||
static inline bool inode_to_wb_is_valid(struct inode *inode)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
|
||||
{
|
||||
return &inode_to_bdi(inode)->wb;
|
||||
}
|
||||
|
||||
static inline struct bdi_writeback *
|
||||
unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
|
||||
{
|
||||
return inode_to_wb(inode);
|
||||
}
|
||||
|
||||
static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wb_memcg_offline(struct mem_cgroup *memcg)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wb_blkcg_offline(struct blkcg *blkcg)
|
||||
{
|
||||
}
|
||||
|
||||
struct wb_iter {
|
||||
int next_id;
|
||||
};
|
||||
|
||||
#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
|
||||
for ((iter)->next_id = (start_blkcg_id); \
|
||||
({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
|
||||
|
||||
static inline int inode_congested(struct inode *inode, int cong_bits)
|
||||
{
|
||||
return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
|
||||
}
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static inline int inode_read_congested(struct inode *inode)
|
||||
{
|
||||
return inode_congested(inode, 1 << WB_sync_congested);
|
||||
}
|
||||
|
||||
static inline int inode_write_congested(struct inode *inode)
|
||||
{
|
||||
return inode_congested(inode, 1 << WB_async_congested);
|
||||
}
|
||||
|
||||
static inline int inode_rw_congested(struct inode *inode)
|
||||
{
|
||||
return inode_congested(inode, (1 << WB_sync_congested) |
|
||||
(1 << WB_async_congested));
|
||||
}
|
||||
|
||||
static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
|
||||
{
|
||||
return wb_congested(&bdi->wb, cong_bits);
|
||||
}
|
||||
|
||||
static inline int bdi_read_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, 1 << WB_sync_congested);
|
||||
}
|
||||
|
||||
static inline int bdi_write_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, 1 << WB_async_congested);
|
||||
}
|
||||
|
||||
static inline int bdi_rw_congested(struct backing_dev_info *bdi)
|
||||
{
|
||||
return bdi_congested(bdi, (1 << WB_sync_congested) |
|
||||
(1 << WB_async_congested));
|
||||
}
|
||||
|
||||
#endif /* _LINUX_BACKING_DEV_H */
|
||||
|
@ -482,9 +482,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
|
||||
extern unsigned int bvec_nr_vecs(unsigned short idx);
|
||||
|
||||
#ifdef CONFIG_BLK_CGROUP
|
||||
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css);
|
||||
int bio_associate_current(struct bio *bio);
|
||||
void bio_disassociate_task(struct bio *bio);
|
||||
#else /* CONFIG_BLK_CGROUP */
|
||||
static inline int bio_associate_blkcg(struct bio *bio,
|
||||
struct cgroup_subsys_state *blkcg_css) { return 0; }
|
||||
static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
|
||||
static inline void bio_disassociate_task(struct bio *bio) { }
|
||||
#endif /* CONFIG_BLK_CGROUP */
|
||||
|
@ -46,6 +46,10 @@ struct blkcg {
|
||||
struct hlist_head blkg_list;
|
||||
|
||||
struct blkcg_policy_data *pd[BLKCG_MAX_POLS];
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct list_head cgwb_list;
|
||||
#endif
|
||||
};
|
||||
|
||||
struct blkg_stat {
|
||||
@ -106,6 +110,12 @@ struct blkcg_gq {
|
||||
struct hlist_node blkcg_node;
|
||||
struct blkcg *blkcg;
|
||||
|
||||
/*
|
||||
* Each blkg gets congested separately and the congestion state is
|
||||
* propagated to the matching bdi_writeback_congested.
|
||||
*/
|
||||
struct bdi_writeback_congested *wb_congested;
|
||||
|
||||
/* all non-root blkcg_gq's are guaranteed to have access to parent */
|
||||
struct blkcg_gq *parent;
|
||||
|
||||
@ -149,6 +159,7 @@ struct blkcg_policy {
|
||||
};
|
||||
|
||||
extern struct blkcg blkcg_root;
|
||||
extern struct cgroup_subsys_state * const blkcg_root_css;
|
||||
|
||||
struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
|
||||
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
|
||||
@ -209,6 +220,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
|
||||
return task_blkcg(current);
|
||||
}
|
||||
|
||||
static inline struct cgroup_subsys_state *
|
||||
task_get_blkcg_css(struct task_struct *task)
|
||||
{
|
||||
return task_get_css(task, blkio_cgrp_id);
|
||||
}
|
||||
|
||||
/**
|
||||
* blkcg_parent - get the parent of a blkcg
|
||||
* @blkcg: blkcg of interest
|
||||
@ -579,8 +596,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
|
||||
|
||||
#else /* CONFIG_BLK_CGROUP */
|
||||
|
||||
struct cgroup;
|
||||
struct blkcg;
|
||||
struct blkcg {
|
||||
};
|
||||
|
||||
struct blkg_policy_data {
|
||||
};
|
||||
@ -594,6 +611,16 @@ struct blkcg_gq {
|
||||
struct blkcg_policy {
|
||||
};
|
||||
|
||||
#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
|
||||
|
||||
static inline struct cgroup_subsys_state *
|
||||
task_get_blkcg_css(struct task_struct *task)
|
||||
{
|
||||
return NULL;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_BLOCK
|
||||
|
||||
static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
|
||||
static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
|
||||
static inline void blkcg_drain_queue(struct request_queue *q) { }
|
||||
@ -623,5 +650,6 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q
|
||||
#define blk_queue_for_each_rl(rl, q) \
|
||||
for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
|
||||
|
||||
#endif /* CONFIG_BLOCK */
|
||||
#endif /* CONFIG_BLK_CGROUP */
|
||||
#endif /* _BLK_CGROUP_H */
|
@ -12,7 +12,7 @@
|
||||
#include <linux/timer.h>
|
||||
#include <linux/workqueue.h>
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/backing-dev-defs.h>
|
||||
#include <linux/wait.h>
|
||||
#include <linux/mempool.h>
|
||||
#include <linux/bio.h>
|
||||
@ -787,25 +787,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
||||
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
|
||||
struct scsi_ioctl_command __user *);
|
||||
|
||||
/*
|
||||
* A queue has just exitted congestion. Note this in the global counter of
|
||||
* congested queues, and wake up anyone who was waiting for requests to be
|
||||
* put back.
|
||||
*/
|
||||
static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
|
||||
{
|
||||
clear_bdi_congested(&q->backing_dev_info, sync);
|
||||
}
|
||||
|
||||
/*
|
||||
* A queue has just entered congestion. Flag that in the queue's VM-visible
|
||||
* state flags and increment the global gounter of congested queues.
|
||||
*/
|
||||
static inline void blk_set_queue_congested(struct request_queue *q, int sync)
|
||||
{
|
||||
set_bdi_congested(&q->backing_dev_info, sync);
|
||||
}
|
||||
|
||||
extern void blk_start_queue(struct request_queue *q);
|
||||
extern void blk_stop_queue(struct request_queue *q);
|
||||
extern void blk_sync_queue(struct request_queue *q);
|
||||
|
@ -773,6 +773,31 @@ static inline struct cgroup_subsys_state *task_css(struct task_struct *task,
|
||||
return task_css_check(task, subsys_id, false);
|
||||
}
|
||||
|
||||
/**
|
||||
* task_get_css - find and get the css for (task, subsys)
|
||||
* @task: the target task
|
||||
* @subsys_id: the target subsystem ID
|
||||
*
|
||||
* Find the css for the (@task, @subsys_id) combination, increment a
|
||||
* reference on and return it. This function is guaranteed to return a
|
||||
* valid css.
|
||||
*/
|
||||
static inline struct cgroup_subsys_state *
|
||||
task_get_css(struct task_struct *task, int subsys_id)
|
||||
{
|
||||
struct cgroup_subsys_state *css;
|
||||
|
||||
rcu_read_lock();
|
||||
while (true) {
|
||||
css = task_css(task, subsys_id);
|
||||
if (likely(css_tryget_online(css)))
|
||||
break;
|
||||
cpu_relax();
|
||||
}
|
||||
rcu_read_unlock();
|
||||
return css;
|
||||
}
|
||||
|
||||
/**
|
||||
* task_css_is_root - test whether a task belongs to the root css
|
||||
* @task: the target task
|
||||
|
@ -35,6 +35,7 @@
|
||||
#include <uapi/linux/fs.h>
|
||||
|
||||
struct backing_dev_info;
|
||||
struct bdi_writeback;
|
||||
struct export_operations;
|
||||
struct hd_geometry;
|
||||
struct iovec;
|
||||
@ -634,6 +635,14 @@ struct inode {
|
||||
|
||||
struct hlist_node i_hash;
|
||||
struct list_head i_wb_list; /* backing dev IO list */
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct bdi_writeback *i_wb; /* the associated cgroup wb */
|
||||
|
||||
/* foreign inode detection, see wbc_detach_inode() */
|
||||
int i_wb_frn_winner;
|
||||
u16 i_wb_frn_avg_time;
|
||||
u16 i_wb_frn_history;
|
||||
#endif
|
||||
struct list_head i_lru; /* inode LRU list */
|
||||
struct list_head i_sb_list;
|
||||
union {
|
||||
@ -1232,6 +1241,8 @@ struct mm_struct;
|
||||
#define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */
|
||||
#define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */
|
||||
|
||||
/* sb->s_iflags */
|
||||
#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */
|
||||
|
||||
/* Possible states of 'frozen' field */
|
||||
enum {
|
||||
@ -1270,6 +1281,7 @@ struct super_block {
|
||||
const struct quotactl_ops *s_qcop;
|
||||
const struct export_operations *s_export_op;
|
||||
unsigned long s_flags;
|
||||
unsigned long s_iflags; /* internal SB_I_* flags */
|
||||
unsigned long s_magic;
|
||||
struct dentry *s_root;
|
||||
struct rw_semaphore s_umount;
|
||||
@ -1806,6 +1818,11 @@ struct super_operations {
|
||||
*
|
||||
* I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit().
|
||||
*
|
||||
* I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
|
||||
* synchronize competing switching instances and to tell
|
||||
* wb stat updates to grab mapping->tree_lock. See
|
||||
* inode_switch_wb_work_fn() for details.
|
||||
*
|
||||
* Q: What is the difference between I_WILL_FREE and I_FREEING?
|
||||
*/
|
||||
#define I_DIRTY_SYNC (1 << 0)
|
||||
@ -1825,6 +1842,7 @@ struct super_operations {
|
||||
#define I_DIRTY_TIME (1 << 11)
|
||||
#define __I_DIRTY_TIME_EXPIRED 12
|
||||
#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
|
||||
#define I_WB_SWITCH (1 << 13)
|
||||
|
||||
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
|
||||
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
|
||||
@ -2241,7 +2259,13 @@ extern struct super_block *freeze_bdev(struct block_device *);
|
||||
extern void emergency_thaw_all(void);
|
||||
extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
|
||||
extern int fsync_bdev(struct block_device *);
|
||||
extern int sb_is_blkdev_sb(struct super_block *sb);
|
||||
|
||||
extern struct super_block *blockdev_superblock;
|
||||
|
||||
static inline bool sb_is_blkdev_sb(struct super_block *sb)
|
||||
{
|
||||
return sb == blockdev_superblock;
|
||||
}
|
||||
#else
|
||||
static inline void bd_forget(struct inode *inode) {}
|
||||
static inline int sync_blockdev(struct block_device *bdev) { return 0; }
|
||||
|
@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
|
||||
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
|
||||
MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */
|
||||
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
|
||||
MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
|
||||
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
|
||||
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
|
||||
MEM_CGROUP_STAT_NSTATS,
|
||||
@ -67,6 +68,8 @@ enum mem_cgroup_events_index {
|
||||
};
|
||||
|
||||
#ifdef CONFIG_MEMCG
|
||||
extern struct cgroup_subsys_state *mem_cgroup_root_css;
|
||||
|
||||
void mem_cgroup_events(struct mem_cgroup *memcg,
|
||||
enum mem_cgroup_events_index idx,
|
||||
unsigned int nr);
|
||||
@ -112,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
|
||||
}
|
||||
|
||||
extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
|
||||
extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
|
||||
|
||||
struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
|
||||
struct mem_cgroup *,
|
||||
@ -195,6 +199,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
|
||||
#else /* CONFIG_MEMCG */
|
||||
struct mem_cgroup;
|
||||
|
||||
#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
|
||||
|
||||
static inline void mem_cgroup_events(struct mem_cgroup *memcg,
|
||||
enum mem_cgroup_events_index idx,
|
||||
unsigned int nr)
|
||||
@ -382,6 +388,29 @@ enum {
|
||||
OVER_LIMIT,
|
||||
};
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
|
||||
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
|
||||
void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
|
||||
unsigned long *pdirty, unsigned long *pwriteback);
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
|
||||
{
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
|
||||
unsigned long *pavail,
|
||||
unsigned long *pdirty,
|
||||
unsigned long *pwriteback)
|
||||
{
|
||||
}
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
struct sock;
|
||||
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
|
||||
void sock_update_memcg(struct sock *sk);
|
||||
|
@ -27,6 +27,7 @@ struct anon_vma_chain;
|
||||
struct file_ra_state;
|
||||
struct user_struct;
|
||||
struct writeback_control;
|
||||
struct bdi_writeback;
|
||||
|
||||
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
|
||||
extern unsigned long max_mapnr;
|
||||
@ -1211,10 +1212,13 @@ int __set_page_dirty_nobuffers(struct page *page);
|
||||
int __set_page_dirty_no_writeback(struct page *page);
|
||||
int redirty_page_for_writepage(struct writeback_control *wbc,
|
||||
struct page *page);
|
||||
void account_page_dirtied(struct page *page, struct address_space *mapping);
|
||||
void account_page_cleaned(struct page *page, struct address_space *mapping);
|
||||
void account_page_dirtied(struct page *page, struct address_space *mapping,
|
||||
struct mem_cgroup *memcg);
|
||||
void account_page_cleaned(struct page *page, struct address_space *mapping,
|
||||
struct mem_cgroup *memcg, struct bdi_writeback *wb);
|
||||
int set_page_dirty(struct page *page);
|
||||
int set_page_dirty_lock(struct page *page);
|
||||
void cancel_dirty_page(struct page *page);
|
||||
int clear_page_dirty_for_io(struct page *page);
|
||||
|
||||
int get_cmdline(struct task_struct *task, char *buffer, int buflen);
|
||||
|
@ -651,7 +651,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
|
||||
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
|
||||
pgoff_t index, gfp_t gfp_mask);
|
||||
extern void delete_from_page_cache(struct page *page);
|
||||
extern void __delete_from_page_cache(struct page *page, void *shadow);
|
||||
extern void __delete_from_page_cache(struct page *page, void *shadow,
|
||||
struct mem_cgroup *memcg);
|
||||
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
|
||||
|
||||
/*
|
||||
|
@ -7,6 +7,8 @@
|
||||
#include <linux/sched.h>
|
||||
#include <linux/workqueue.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/flex_proportions.h>
|
||||
#include <linux/backing-dev-defs.h>
|
||||
|
||||
DECLARE_PER_CPU(int, dirty_throttle_leaks);
|
||||
|
||||
@ -84,8 +86,85 @@ struct writeback_control {
|
||||
unsigned for_reclaim:1; /* Invoked from the page allocator */
|
||||
unsigned range_cyclic:1; /* range_start is cyclic */
|
||||
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct bdi_writeback *wb; /* wb this writeback is issued under */
|
||||
struct inode *inode; /* inode being written out */
|
||||
|
||||
/* foreign inode detection, see wbc_detach_inode() */
|
||||
int wb_id; /* current wb id */
|
||||
int wb_lcand_id; /* last foreign candidate wb id */
|
||||
int wb_tcand_id; /* this foreign candidate wb id */
|
||||
size_t wb_bytes; /* bytes written by current wb */
|
||||
size_t wb_lcand_bytes; /* bytes written by last candidate */
|
||||
size_t wb_tcand_bytes; /* bytes written by this candidate */
|
||||
#endif
|
||||
};
|
||||
|
||||
/*
|
||||
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
|
||||
* and are measured against each other in. There always is one global
|
||||
* domain, global_wb_domain, that every wb in the system is a member of.
|
||||
* This allows measuring the relative bandwidth of each wb to distribute
|
||||
* dirtyable memory accordingly.
|
||||
*/
|
||||
struct wb_domain {
|
||||
spinlock_t lock;
|
||||
|
||||
/*
|
||||
* Scale the writeback cache size proportional to the relative
|
||||
* writeout speed.
|
||||
*
|
||||
* We do this by keeping a floating proportion between BDIs, based
|
||||
* on page writeback completions [end_page_writeback()]. Those
|
||||
* devices that write out pages fastest will get the larger share,
|
||||
* while the slower will get a smaller share.
|
||||
*
|
||||
* We use page writeout completions because we are interested in
|
||||
* getting rid of dirty pages. Having them written out is the
|
||||
* primary goal.
|
||||
*
|
||||
* We introduce a concept of time, a period over which we measure
|
||||
* these events, because demand can/will vary over time. The length
|
||||
* of this period itself is measured in page writeback completions.
|
||||
*/
|
||||
struct fprop_global completions;
|
||||
struct timer_list period_timer; /* timer for aging of completions */
|
||||
unsigned long period_time;
|
||||
|
||||
/*
|
||||
* The dirtyable memory and dirty threshold could be suddenly
|
||||
* knocked down by a large amount (eg. on the startup of KVM in a
|
||||
* swapless system). This may throw the system into deep dirty
|
||||
* exceeded state and throttle heavy/light dirtiers alike. To
|
||||
* retain good responsiveness, maintain global_dirty_limit for
|
||||
* tracking slowly down to the knocked down dirty threshold.
|
||||
*
|
||||
* Both fields are protected by ->lock.
|
||||
*/
|
||||
unsigned long dirty_limit_tstamp;
|
||||
unsigned long dirty_limit;
|
||||
};
|
||||
|
||||
/**
|
||||
* wb_domain_size_changed - memory available to a wb_domain has changed
|
||||
* @dom: wb_domain of interest
|
||||
*
|
||||
* This function should be called when the amount of memory available to
|
||||
* @dom has changed. It resets @dom's dirty limit parameters to prevent
|
||||
* the past values which don't match the current configuration from skewing
|
||||
* dirty throttling. Without this, when memory size of a wb_domain is
|
||||
* greatly reduced, the dirty throttling logic may allow too many pages to
|
||||
* be dirtied leading to consecutive unnecessary OOMs and may get stuck in
|
||||
* that situation.
|
||||
*/
|
||||
static inline void wb_domain_size_changed(struct wb_domain *dom)
|
||||
{
|
||||
spin_lock(&dom->lock);
|
||||
dom->dirty_limit_tstamp = jiffies;
|
||||
dom->dirty_limit = 0;
|
||||
spin_unlock(&dom->lock);
|
||||
}
|
||||
|
||||
/*
|
||||
* fs/fs-writeback.c
|
||||
*/
|
||||
@ -93,9 +172,9 @@ struct bdi_writeback;
|
||||
void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
|
||||
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
|
||||
enum wb_reason reason);
|
||||
int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
|
||||
int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
|
||||
enum wb_reason reason);
|
||||
bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
|
||||
bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
|
||||
enum wb_reason reason);
|
||||
void sync_inodes_sb(struct super_block *);
|
||||
void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
|
||||
void inode_wait_for_writeback(struct inode *inode);
|
||||
@ -107,6 +186,123 @@ static inline void wait_on_inode(struct inode *inode)
|
||||
wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
#include <linux/cgroup.h>
|
||||
#include <linux/bio.h>
|
||||
|
||||
void __inode_attach_wb(struct inode *inode, struct page *page);
|
||||
void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
|
||||
struct inode *inode)
|
||||
__releases(&inode->i_lock);
|
||||
void wbc_detach_inode(struct writeback_control *wbc);
|
||||
void wbc_account_io(struct writeback_control *wbc, struct page *page,
|
||||
size_t bytes);
|
||||
|
||||
/**
|
||||
* inode_attach_wb - associate an inode with its wb
|
||||
* @inode: inode of interest
|
||||
* @page: page being dirtied (may be NULL)
|
||||
*
|
||||
* If @inode doesn't have its wb, associate it with the wb matching the
|
||||
* memcg of @page or, if @page is NULL, %current. May be called w/ or w/o
|
||||
* @inode->i_lock.
|
||||
*/
|
||||
static inline void inode_attach_wb(struct inode *inode, struct page *page)
|
||||
{
|
||||
if (!inode->i_wb)
|
||||
__inode_attach_wb(inode, page);
|
||||
}
|
||||
|
||||
/**
|
||||
* inode_detach_wb - disassociate an inode from its wb
|
||||
* @inode: inode of interest
|
||||
*
|
||||
* @inode is being freed. Detach from its wb.
|
||||
*/
|
||||
static inline void inode_detach_wb(struct inode *inode)
|
||||
{
|
||||
if (inode->i_wb) {
|
||||
wb_put(inode->i_wb);
|
||||
inode->i_wb = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
|
||||
* @wbc: writeback_control of interest
|
||||
* @inode: target inode
|
||||
*
|
||||
* This function is to be used by __filemap_fdatawrite_range(), which is an
|
||||
* alternative entry point into writeback code, and first ensures @inode is
|
||||
* associated with a bdi_writeback and attaches it to @wbc.
|
||||
*/
|
||||
static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc,
|
||||
struct inode *inode)
|
||||
{
|
||||
spin_lock(&inode->i_lock);
|
||||
inode_attach_wb(inode, NULL);
|
||||
wbc_attach_and_unlock_inode(wbc, inode);
|
||||
}
|
||||
|
||||
/**
|
||||
* wbc_init_bio - writeback specific initializtion of bio
|
||||
* @wbc: writeback_control for the writeback in progress
|
||||
* @bio: bio to be initialized
|
||||
*
|
||||
* @bio is a part of the writeback in progress controlled by @wbc. Perform
|
||||
* writeback specific initialization. This is used to apply the cgroup
|
||||
* writeback context.
|
||||
*/
|
||||
static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
|
||||
{
|
||||
/*
|
||||
* pageout() path doesn't attach @wbc to the inode being written
|
||||
* out. This is intentional as we don't want the function to block
|
||||
* behind a slow cgroup. Ultimately, we want pageout() to kick off
|
||||
* regular writeback instead of writing things out itself.
|
||||
*/
|
||||
if (wbc->wb)
|
||||
bio_associate_blkcg(bio, wbc->wb->blkcg_css);
|
||||
}
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static inline void inode_attach_wb(struct inode *inode, struct page *page)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void inode_detach_wb(struct inode *inode)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
|
||||
struct inode *inode)
|
||||
__releases(&inode->i_lock)
|
||||
{
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
|
||||
static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc,
|
||||
struct inode *inode)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wbc_detach_inode(struct writeback_control *wbc)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void wbc_account_io(struct writeback_control *wbc,
|
||||
struct page *page, size_t bytes)
|
||||
{
|
||||
}
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
/*
|
||||
* mm/page-writeback.c
|
||||
*/
|
||||
@ -120,8 +316,12 @@ static inline void laptop_sync_completion(void) { }
|
||||
#endif
|
||||
void throttle_vm_writeout(gfp_t gfp_mask);
|
||||
bool zone_dirty_ok(struct zone *zone);
|
||||
int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
void wb_domain_exit(struct wb_domain *dom);
|
||||
#endif
|
||||
|
||||
extern unsigned long global_dirty_limit;
|
||||
extern struct wb_domain global_wb_domain;
|
||||
|
||||
/* These are exported to sysctl. */
|
||||
extern int dirty_background_ratio;
|
||||
@ -155,19 +355,12 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
|
||||
void __user *, size_t *, loff_t *);
|
||||
|
||||
void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
|
||||
unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
|
||||
unsigned long dirty);
|
||||
|
||||
void __bdi_update_bandwidth(struct backing_dev_info *bdi,
|
||||
unsigned long thresh,
|
||||
unsigned long bg_thresh,
|
||||
unsigned long dirty,
|
||||
unsigned long bdi_thresh,
|
||||
unsigned long bdi_dirty,
|
||||
unsigned long start_time);
|
||||
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
|
||||
|
||||
void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
|
||||
void page_writeback_init(void);
|
||||
void balance_dirty_pages_ratelimited(struct address_space *mapping);
|
||||
bool wb_over_bg_thresh(struct bdi_writeback *wb);
|
||||
|
||||
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
|
||||
void *data);
|
||||
|
@ -360,7 +360,7 @@ TRACE_EVENT(global_dirty_state,
|
||||
__entry->nr_written = global_page_state(NR_WRITTEN);
|
||||
__entry->background_thresh = background_thresh;
|
||||
__entry->dirty_thresh = dirty_thresh;
|
||||
__entry->dirty_limit = global_dirty_limit;
|
||||
__entry->dirty_limit = global_wb_domain.dirty_limit;
|
||||
),
|
||||
|
||||
TP_printk("dirty=%lu writeback=%lu unstable=%lu "
|
||||
@ -399,13 +399,13 @@ TRACE_EVENT(bdi_dirty_ratelimit,
|
||||
|
||||
TP_fast_assign(
|
||||
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
|
||||
__entry->write_bw = KBps(bdi->write_bandwidth);
|
||||
__entry->avg_write_bw = KBps(bdi->avg_write_bandwidth);
|
||||
__entry->write_bw = KBps(bdi->wb.write_bandwidth);
|
||||
__entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
|
||||
__entry->dirty_rate = KBps(dirty_rate);
|
||||
__entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
|
||||
__entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
|
||||
__entry->task_ratelimit = KBps(task_ratelimit);
|
||||
__entry->balanced_dirty_ratelimit =
|
||||
KBps(bdi->balanced_dirty_ratelimit);
|
||||
KBps(bdi->wb.balanced_dirty_ratelimit);
|
||||
),
|
||||
|
||||
TP_printk("bdi %s: "
|
||||
@ -462,8 +462,9 @@ TRACE_EVENT(balance_dirty_pages,
|
||||
unsigned long freerun = (thresh + bg_thresh) / 2;
|
||||
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
|
||||
|
||||
__entry->limit = global_dirty_limit;
|
||||
__entry->setpoint = (global_dirty_limit + freerun) / 2;
|
||||
__entry->limit = global_wb_domain.dirty_limit;
|
||||
__entry->setpoint = (global_wb_domain.dirty_limit +
|
||||
freerun) / 2;
|
||||
__entry->dirty = dirty;
|
||||
__entry->bdi_setpoint = __entry->setpoint *
|
||||
bdi_thresh / (thresh + 1);
|
||||
|
@ -1127,6 +1127,11 @@ config DEBUG_BLK_CGROUP
|
||||
Enable some debugging help. Currently it exports additional stat
|
||||
files in a cgroup which can be useful for debugging.
|
||||
|
||||
config CGROUP_WRITEBACK
|
||||
bool
|
||||
depends on MEMCG && BLK_CGROUP
|
||||
default y
|
||||
|
||||
endif # CGROUPS
|
||||
|
||||
config CHECKPOINT_RESTORE
|
||||
|
652
mm/backing-dev.c
652
mm/backing-dev.c
@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = {
|
||||
.name = "noop",
|
||||
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
|
||||
};
|
||||
EXPORT_SYMBOL_GPL(noop_backing_dev_info);
|
||||
|
||||
static struct class *bdi_class;
|
||||
|
||||
@ -48,7 +49,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
|
||||
struct bdi_writeback *wb = &bdi->wb;
|
||||
unsigned long background_thresh;
|
||||
unsigned long dirty_thresh;
|
||||
unsigned long bdi_thresh;
|
||||
unsigned long wb_thresh;
|
||||
unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
|
||||
struct inode *inode;
|
||||
|
||||
@ -66,7 +67,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
|
||||
spin_unlock(&wb->list_lock);
|
||||
|
||||
global_dirty_limits(&background_thresh, &dirty_thresh);
|
||||
bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
|
||||
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
|
||||
|
||||
#define K(x) ((x) << (PAGE_SHIFT - 10))
|
||||
seq_printf(m,
|
||||
@ -84,19 +85,19 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
|
||||
"b_dirty_time: %10lu\n"
|
||||
"bdi_list: %10u\n"
|
||||
"state: %10lx\n",
|
||||
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
|
||||
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
|
||||
K(bdi_thresh),
|
||||
(unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
|
||||
(unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
|
||||
K(wb_thresh),
|
||||
K(dirty_thresh),
|
||||
K(background_thresh),
|
||||
(unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
|
||||
(unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
|
||||
(unsigned long) K(bdi->write_bandwidth),
|
||||
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
|
||||
(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
|
||||
(unsigned long) K(wb->write_bandwidth),
|
||||
nr_dirty,
|
||||
nr_io,
|
||||
nr_more_io,
|
||||
nr_dirty_time,
|
||||
!list_empty(&bdi->bdi_list), bdi->state);
|
||||
!list_empty(&bdi->bdi_list), bdi->wb.state);
|
||||
#undef K
|
||||
|
||||
return 0;
|
||||
@ -255,13 +256,8 @@ static int __init default_bdi_init(void)
|
||||
}
|
||||
subsys_initcall(default_bdi_init);
|
||||
|
||||
int bdi_has_dirty_io(struct backing_dev_info *bdi)
|
||||
{
|
||||
return wb_has_dirty_io(&bdi->wb);
|
||||
}
|
||||
|
||||
/*
|
||||
* This function is used when the first inode for this bdi is marked dirty. It
|
||||
* This function is used when the first inode for this wb is marked dirty. It
|
||||
* wakes-up the corresponding bdi thread which should then take care of the
|
||||
* periodic background write-out of dirty inodes. Since the write-out would
|
||||
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
|
||||
@ -274,29 +270,497 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
|
||||
* We have to be careful not to postpone flush work if it is scheduled for
|
||||
* earlier. Thus we use queue_delayed_work().
|
||||
*/
|
||||
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
|
||||
void wb_wakeup_delayed(struct bdi_writeback *wb)
|
||||
{
|
||||
unsigned long timeout;
|
||||
|
||||
timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
|
||||
spin_lock_bh(&bdi->wb_lock);
|
||||
if (test_bit(BDI_registered, &bdi->state))
|
||||
queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
|
||||
spin_unlock_bh(&bdi->wb_lock);
|
||||
spin_lock_bh(&wb->work_lock);
|
||||
if (test_bit(WB_registered, &wb->state))
|
||||
queue_delayed_work(bdi_wq, &wb->dwork, timeout);
|
||||
spin_unlock_bh(&wb->work_lock);
|
||||
}
|
||||
|
||||
/*
|
||||
* Remove bdi from bdi_list, and ensure that it is no longer visible
|
||||
* Initial write bandwidth: 100 MB/s
|
||||
*/
|
||||
static void bdi_remove_from_list(struct backing_dev_info *bdi)
|
||||
{
|
||||
spin_lock_bh(&bdi_lock);
|
||||
list_del_rcu(&bdi->bdi_list);
|
||||
spin_unlock_bh(&bdi_lock);
|
||||
#define INIT_BW (100 << (20 - PAGE_SHIFT))
|
||||
|
||||
synchronize_rcu_expedited();
|
||||
static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
|
||||
gfp_t gfp)
|
||||
{
|
||||
int i, err;
|
||||
|
||||
memset(wb, 0, sizeof(*wb));
|
||||
|
||||
wb->bdi = bdi;
|
||||
wb->last_old_flush = jiffies;
|
||||
INIT_LIST_HEAD(&wb->b_dirty);
|
||||
INIT_LIST_HEAD(&wb->b_io);
|
||||
INIT_LIST_HEAD(&wb->b_more_io);
|
||||
INIT_LIST_HEAD(&wb->b_dirty_time);
|
||||
spin_lock_init(&wb->list_lock);
|
||||
|
||||
wb->bw_time_stamp = jiffies;
|
||||
wb->balanced_dirty_ratelimit = INIT_BW;
|
||||
wb->dirty_ratelimit = INIT_BW;
|
||||
wb->write_bandwidth = INIT_BW;
|
||||
wb->avg_write_bandwidth = INIT_BW;
|
||||
|
||||
spin_lock_init(&wb->work_lock);
|
||||
INIT_LIST_HEAD(&wb->work_list);
|
||||
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
|
||||
|
||||
err = fprop_local_init_percpu(&wb->completions, gfp);
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
|
||||
err = percpu_counter_init(&wb->stat[i], 0, gfp);
|
||||
if (err) {
|
||||
while (--i)
|
||||
percpu_counter_destroy(&wb->stat[i]);
|
||||
fprop_local_destroy_percpu(&wb->completions);
|
||||
return err;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Remove bdi from the global list and shutdown any threads we have running
|
||||
*/
|
||||
static void wb_shutdown(struct bdi_writeback *wb)
|
||||
{
|
||||
/* Make sure nobody queues further work */
|
||||
spin_lock_bh(&wb->work_lock);
|
||||
if (!test_and_clear_bit(WB_registered, &wb->state)) {
|
||||
spin_unlock_bh(&wb->work_lock);
|
||||
return;
|
||||
}
|
||||
spin_unlock_bh(&wb->work_lock);
|
||||
|
||||
/*
|
||||
* Drain work list and shutdown the delayed_work. !WB_registered
|
||||
* tells wb_workfn() that @wb is dying and its work_list needs to
|
||||
* be drained no matter what.
|
||||
*/
|
||||
mod_delayed_work(bdi_wq, &wb->dwork, 0);
|
||||
flush_delayed_work(&wb->dwork);
|
||||
WARN_ON(!list_empty(&wb->work_list));
|
||||
}
|
||||
|
||||
static void wb_exit(struct bdi_writeback *wb)
|
||||
{
|
||||
int i;
|
||||
|
||||
WARN_ON(delayed_work_pending(&wb->dwork));
|
||||
|
||||
for (i = 0; i < NR_WB_STAT_ITEMS; i++)
|
||||
percpu_counter_destroy(&wb->stat[i]);
|
||||
|
||||
fprop_local_destroy_percpu(&wb->completions);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
#include <linux/memcontrol.h>
|
||||
|
||||
/*
|
||||
* cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
|
||||
* blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
|
||||
* protected. cgwb_release_wait is used to wait for the completion of cgwb
|
||||
* releases from bdi destruction path.
|
||||
*/
|
||||
static DEFINE_SPINLOCK(cgwb_lock);
|
||||
static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
|
||||
|
||||
/**
|
||||
* wb_congested_get_create - get or create a wb_congested
|
||||
* @bdi: associated bdi
|
||||
* @blkcg_id: ID of the associated blkcg
|
||||
* @gfp: allocation mask
|
||||
*
|
||||
* Look up the wb_congested for @blkcg_id on @bdi. If missing, create one.
|
||||
* The returned wb_congested has its reference count incremented. Returns
|
||||
* NULL on failure.
|
||||
*/
|
||||
struct bdi_writeback_congested *
|
||||
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
|
||||
{
|
||||
struct bdi_writeback_congested *new_congested = NULL, *congested;
|
||||
struct rb_node **node, *parent;
|
||||
unsigned long flags;
|
||||
|
||||
if (blkcg_id == 1)
|
||||
return &bdi->wb_congested;
|
||||
retry:
|
||||
spin_lock_irqsave(&cgwb_lock, flags);
|
||||
|
||||
node = &bdi->cgwb_congested_tree.rb_node;
|
||||
parent = NULL;
|
||||
|
||||
while (*node != NULL) {
|
||||
parent = *node;
|
||||
congested = container_of(parent, struct bdi_writeback_congested,
|
||||
rb_node);
|
||||
if (congested->blkcg_id < blkcg_id)
|
||||
node = &parent->rb_left;
|
||||
else if (congested->blkcg_id > blkcg_id)
|
||||
node = &parent->rb_right;
|
||||
else
|
||||
goto found;
|
||||
}
|
||||
|
||||
if (new_congested) {
|
||||
/* !found and storage for new one already allocated, insert */
|
||||
congested = new_congested;
|
||||
new_congested = NULL;
|
||||
rb_link_node(&congested->rb_node, parent, node);
|
||||
rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree);
|
||||
atomic_inc(&bdi->usage_cnt);
|
||||
goto found;
|
||||
}
|
||||
|
||||
spin_unlock_irqrestore(&cgwb_lock, flags);
|
||||
|
||||
/* allocate storage for new one and retry */
|
||||
new_congested = kzalloc(sizeof(*new_congested), gfp);
|
||||
if (!new_congested)
|
||||
return NULL;
|
||||
|
||||
atomic_set(&new_congested->refcnt, 0);
|
||||
new_congested->bdi = bdi;
|
||||
new_congested->blkcg_id = blkcg_id;
|
||||
goto retry;
|
||||
|
||||
found:
|
||||
atomic_inc(&congested->refcnt);
|
||||
spin_unlock_irqrestore(&cgwb_lock, flags);
|
||||
kfree(new_congested);
|
||||
return congested;
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_congested_put - put a wb_congested
|
||||
* @congested: wb_congested to put
|
||||
*
|
||||
* Put @congested and destroy it if the refcnt reaches zero.
|
||||
*/
|
||||
void wb_congested_put(struct bdi_writeback_congested *congested)
|
||||
{
|
||||
struct backing_dev_info *bdi = congested->bdi;
|
||||
unsigned long flags;
|
||||
|
||||
if (congested->blkcg_id == 1)
|
||||
return;
|
||||
|
||||
local_irq_save(flags);
|
||||
if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) {
|
||||
local_irq_restore(flags);
|
||||
return;
|
||||
}
|
||||
|
||||
rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree);
|
||||
spin_unlock_irqrestore(&cgwb_lock, flags);
|
||||
kfree(congested);
|
||||
|
||||
if (atomic_dec_and_test(&bdi->usage_cnt))
|
||||
wake_up_all(&cgwb_release_wait);
|
||||
}
|
||||
|
||||
static void cgwb_release_workfn(struct work_struct *work)
|
||||
{
|
||||
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
|
||||
release_work);
|
||||
struct backing_dev_info *bdi = wb->bdi;
|
||||
|
||||
wb_shutdown(wb);
|
||||
|
||||
css_put(wb->memcg_css);
|
||||
css_put(wb->blkcg_css);
|
||||
wb_congested_put(wb->congested);
|
||||
|
||||
fprop_local_destroy_percpu(&wb->memcg_completions);
|
||||
percpu_ref_exit(&wb->refcnt);
|
||||
wb_exit(wb);
|
||||
kfree_rcu(wb, rcu);
|
||||
|
||||
if (atomic_dec_and_test(&bdi->usage_cnt))
|
||||
wake_up_all(&cgwb_release_wait);
|
||||
}
|
||||
|
||||
static void cgwb_release(struct percpu_ref *refcnt)
|
||||
{
|
||||
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
|
||||
refcnt);
|
||||
schedule_work(&wb->release_work);
|
||||
}
|
||||
|
||||
static void cgwb_kill(struct bdi_writeback *wb)
|
||||
{
|
||||
lockdep_assert_held(&cgwb_lock);
|
||||
|
||||
WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
|
||||
list_del(&wb->memcg_node);
|
||||
list_del(&wb->blkcg_node);
|
||||
percpu_ref_kill(&wb->refcnt);
|
||||
}
|
||||
|
||||
static int cgwb_create(struct backing_dev_info *bdi,
|
||||
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
|
||||
{
|
||||
struct mem_cgroup *memcg;
|
||||
struct cgroup_subsys_state *blkcg_css;
|
||||
struct blkcg *blkcg;
|
||||
struct list_head *memcg_cgwb_list, *blkcg_cgwb_list;
|
||||
struct bdi_writeback *wb;
|
||||
unsigned long flags;
|
||||
int ret = 0;
|
||||
|
||||
memcg = mem_cgroup_from_css(memcg_css);
|
||||
blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys);
|
||||
blkcg = css_to_blkcg(blkcg_css);
|
||||
memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
|
||||
blkcg_cgwb_list = &blkcg->cgwb_list;
|
||||
|
||||
/* look up again under lock and discard on blkcg mismatch */
|
||||
spin_lock_irqsave(&cgwb_lock, flags);
|
||||
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
||||
if (wb && wb->blkcg_css != blkcg_css) {
|
||||
cgwb_kill(wb);
|
||||
wb = NULL;
|
||||
}
|
||||
spin_unlock_irqrestore(&cgwb_lock, flags);
|
||||
if (wb)
|
||||
goto out_put;
|
||||
|
||||
/* need to create a new one */
|
||||
wb = kmalloc(sizeof(*wb), gfp);
|
||||
if (!wb)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = wb_init(wb, bdi, gfp);
|
||||
if (ret)
|
||||
goto err_free;
|
||||
|
||||
ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp);
|
||||
if (ret)
|
||||
goto err_wb_exit;
|
||||
|
||||
ret = fprop_local_init_percpu(&wb->memcg_completions, gfp);
|
||||
if (ret)
|
||||
goto err_ref_exit;
|
||||
|
||||
wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp);
|
||||
if (!wb->congested) {
|
||||
ret = -ENOMEM;
|
||||
goto err_fprop_exit;
|
||||
}
|
||||
|
||||
wb->memcg_css = memcg_css;
|
||||
wb->blkcg_css = blkcg_css;
|
||||
INIT_WORK(&wb->release_work, cgwb_release_workfn);
|
||||
set_bit(WB_registered, &wb->state);
|
||||
|
||||
/*
|
||||
* The root wb determines the registered state of the whole bdi and
|
||||
* memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate
|
||||
* whether they're still online. Don't link @wb if any is dead.
|
||||
* See wb_memcg_offline() and wb_blkcg_offline().
|
||||
*/
|
||||
ret = -ENODEV;
|
||||
spin_lock_irqsave(&cgwb_lock, flags);
|
||||
if (test_bit(WB_registered, &bdi->wb.state) &&
|
||||
blkcg_cgwb_list->next && memcg_cgwb_list->next) {
|
||||
/* we might have raced another instance of this function */
|
||||
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
|
||||
if (!ret) {
|
||||
atomic_inc(&bdi->usage_cnt);
|
||||
list_add(&wb->memcg_node, memcg_cgwb_list);
|
||||
list_add(&wb->blkcg_node, blkcg_cgwb_list);
|
||||
css_get(memcg_css);
|
||||
css_get(blkcg_css);
|
||||
}
|
||||
}
|
||||
spin_unlock_irqrestore(&cgwb_lock, flags);
|
||||
if (ret) {
|
||||
if (ret == -EEXIST)
|
||||
ret = 0;
|
||||
goto err_put_congested;
|
||||
}
|
||||
goto out_put;
|
||||
|
||||
err_put_congested:
|
||||
wb_congested_put(wb->congested);
|
||||
err_fprop_exit:
|
||||
fprop_local_destroy_percpu(&wb->memcg_completions);
|
||||
err_ref_exit:
|
||||
percpu_ref_exit(&wb->refcnt);
|
||||
err_wb_exit:
|
||||
wb_exit(wb);
|
||||
err_free:
|
||||
kfree(wb);
|
||||
out_put:
|
||||
css_put(blkcg_css);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_get_create - get wb for a given memcg, create if necessary
|
||||
* @bdi: target bdi
|
||||
* @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref)
|
||||
* @gfp: allocation mask to use
|
||||
*
|
||||
* Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to
|
||||
* create one. The returned wb has its refcount incremented.
|
||||
*
|
||||
* This function uses css_get() on @memcg_css and thus expects its refcnt
|
||||
* to be positive on invocation. IOW, rcu_read_lock() protection on
|
||||
* @memcg_css isn't enough. try_get it before calling this function.
|
||||
*
|
||||
* A wb is keyed by its associated memcg. As blkcg implicitly enables
|
||||
* memcg on the default hierarchy, memcg association is guaranteed to be
|
||||
* more specific (equal or descendant to the associated blkcg) and thus can
|
||||
* identify both the memcg and blkcg associations.
|
||||
*
|
||||
* Because the blkcg associated with a memcg may change as blkcg is enabled
|
||||
* and disabled closer to root in the hierarchy, each wb keeps track of
|
||||
* both the memcg and blkcg associated with it and verifies the blkcg on
|
||||
* each lookup. On mismatch, the existing wb is discarded and a new one is
|
||||
* created.
|
||||
*/
|
||||
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
|
||||
struct cgroup_subsys_state *memcg_css,
|
||||
gfp_t gfp)
|
||||
{
|
||||
struct bdi_writeback *wb;
|
||||
|
||||
might_sleep_if(gfp & __GFP_WAIT);
|
||||
|
||||
if (!memcg_css->parent)
|
||||
return &bdi->wb;
|
||||
|
||||
do {
|
||||
rcu_read_lock();
|
||||
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
||||
if (wb) {
|
||||
struct cgroup_subsys_state *blkcg_css;
|
||||
|
||||
/* see whether the blkcg association has changed */
|
||||
blkcg_css = cgroup_get_e_css(memcg_css->cgroup,
|
||||
&blkio_cgrp_subsys);
|
||||
if (unlikely(wb->blkcg_css != blkcg_css ||
|
||||
!wb_tryget(wb)))
|
||||
wb = NULL;
|
||||
css_put(blkcg_css);
|
||||
}
|
||||
rcu_read_unlock();
|
||||
} while (!wb && !cgwb_create(bdi, memcg_css, gfp));
|
||||
|
||||
return wb;
|
||||
}
|
||||
|
||||
static void cgwb_bdi_init(struct backing_dev_info *bdi)
|
||||
{
|
||||
bdi->wb.memcg_css = mem_cgroup_root_css;
|
||||
bdi->wb.blkcg_css = blkcg_root_css;
|
||||
bdi->wb_congested.blkcg_id = 1;
|
||||
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
|
||||
bdi->cgwb_congested_tree = RB_ROOT;
|
||||
atomic_set(&bdi->usage_cnt, 1);
|
||||
}
|
||||
|
||||
static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
|
||||
{
|
||||
struct radix_tree_iter iter;
|
||||
void **slot;
|
||||
|
||||
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
|
||||
|
||||
spin_lock_irq(&cgwb_lock);
|
||||
radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
|
||||
cgwb_kill(*slot);
|
||||
spin_unlock_irq(&cgwb_lock);
|
||||
|
||||
/*
|
||||
* All cgwb's and their congested states must be shutdown and
|
||||
* released before returning. Drain the usage counter to wait for
|
||||
* all cgwb's and cgwb_congested's ever created on @bdi.
|
||||
*/
|
||||
atomic_dec(&bdi->usage_cnt);
|
||||
wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_memcg_offline - kill all wb's associated with a memcg being offlined
|
||||
* @memcg: memcg being offlined
|
||||
*
|
||||
* Also prevents creation of any new wb's associated with @memcg.
|
||||
*/
|
||||
void wb_memcg_offline(struct mem_cgroup *memcg)
|
||||
{
|
||||
LIST_HEAD(to_destroy);
|
||||
struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
|
||||
struct bdi_writeback *wb, *next;
|
||||
|
||||
spin_lock_irq(&cgwb_lock);
|
||||
list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
|
||||
cgwb_kill(wb);
|
||||
memcg_cgwb_list->next = NULL; /* prevent new wb's */
|
||||
spin_unlock_irq(&cgwb_lock);
|
||||
}
|
||||
|
||||
/**
|
||||
* wb_blkcg_offline - kill all wb's associated with a blkcg being offlined
|
||||
* @blkcg: blkcg being offlined
|
||||
*
|
||||
* Also prevents creation of any new wb's associated with @blkcg.
|
||||
*/
|
||||
void wb_blkcg_offline(struct blkcg *blkcg)
|
||||
{
|
||||
LIST_HEAD(to_destroy);
|
||||
struct bdi_writeback *wb, *next;
|
||||
|
||||
spin_lock_irq(&cgwb_lock);
|
||||
list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
|
||||
cgwb_kill(wb);
|
||||
blkcg->cgwb_list.next = NULL; /* prevent new wb's */
|
||||
spin_unlock_irq(&cgwb_lock);
|
||||
}
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static void cgwb_bdi_init(struct backing_dev_info *bdi) { }
|
||||
static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
int bdi_init(struct backing_dev_info *bdi)
|
||||
{
|
||||
int err;
|
||||
|
||||
bdi->dev = NULL;
|
||||
|
||||
bdi->min_ratio = 0;
|
||||
bdi->max_ratio = 100;
|
||||
bdi->max_prop_frac = FPROP_FRAC_BASE;
|
||||
INIT_LIST_HEAD(&bdi->bdi_list);
|
||||
init_waitqueue_head(&bdi->wb_waitq);
|
||||
|
||||
err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
bdi->wb_congested.state = 0;
|
||||
bdi->wb.congested = &bdi->wb_congested;
|
||||
|
||||
cgwb_bdi_init(bdi);
|
||||
return 0;
|
||||
}
|
||||
EXPORT_SYMBOL(bdi_init);
|
||||
|
||||
int bdi_register(struct backing_dev_info *bdi, struct device *parent,
|
||||
const char *fmt, ...)
|
||||
{
|
||||
@ -315,7 +779,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
|
||||
bdi->dev = dev;
|
||||
|
||||
bdi_debug_register(bdi, dev_name(dev));
|
||||
set_bit(BDI_registered, &bdi->state);
|
||||
set_bit(WB_registered, &bdi->wb.state);
|
||||
|
||||
spin_lock_bh(&bdi_lock);
|
||||
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
|
||||
@ -333,103 +797,23 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
|
||||
EXPORT_SYMBOL(bdi_register_dev);
|
||||
|
||||
/*
|
||||
* Remove bdi from the global list and shutdown any threads we have running
|
||||
* Remove bdi from bdi_list, and ensure that it is no longer visible
|
||||
*/
|
||||
static void bdi_wb_shutdown(struct backing_dev_info *bdi)
|
||||
static void bdi_remove_from_list(struct backing_dev_info *bdi)
|
||||
{
|
||||
/* Make sure nobody queues further work */
|
||||
spin_lock_bh(&bdi->wb_lock);
|
||||
if (!test_and_clear_bit(BDI_registered, &bdi->state)) {
|
||||
spin_unlock_bh(&bdi->wb_lock);
|
||||
return;
|
||||
}
|
||||
spin_unlock_bh(&bdi->wb_lock);
|
||||
spin_lock_bh(&bdi_lock);
|
||||
list_del_rcu(&bdi->bdi_list);
|
||||
spin_unlock_bh(&bdi_lock);
|
||||
|
||||
/*
|
||||
* Make sure nobody finds us on the bdi_list anymore
|
||||
*/
|
||||
bdi_remove_from_list(bdi);
|
||||
|
||||
/*
|
||||
* Drain work list and shutdown the delayed_work. At this point,
|
||||
* @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
|
||||
* is dying and its work_list needs to be drained no matter what.
|
||||
*/
|
||||
mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
|
||||
flush_delayed_work(&bdi->wb.dwork);
|
||||
synchronize_rcu_expedited();
|
||||
}
|
||||
|
||||
static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
|
||||
{
|
||||
memset(wb, 0, sizeof(*wb));
|
||||
|
||||
wb->bdi = bdi;
|
||||
wb->last_old_flush = jiffies;
|
||||
INIT_LIST_HEAD(&wb->b_dirty);
|
||||
INIT_LIST_HEAD(&wb->b_io);
|
||||
INIT_LIST_HEAD(&wb->b_more_io);
|
||||
INIT_LIST_HEAD(&wb->b_dirty_time);
|
||||
spin_lock_init(&wb->list_lock);
|
||||
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
|
||||
}
|
||||
|
||||
/*
|
||||
* Initial write bandwidth: 100 MB/s
|
||||
*/
|
||||
#define INIT_BW (100 << (20 - PAGE_SHIFT))
|
||||
|
||||
int bdi_init(struct backing_dev_info *bdi)
|
||||
{
|
||||
int i, err;
|
||||
|
||||
bdi->dev = NULL;
|
||||
|
||||
bdi->min_ratio = 0;
|
||||
bdi->max_ratio = 100;
|
||||
bdi->max_prop_frac = FPROP_FRAC_BASE;
|
||||
spin_lock_init(&bdi->wb_lock);
|
||||
INIT_LIST_HEAD(&bdi->bdi_list);
|
||||
INIT_LIST_HEAD(&bdi->work_list);
|
||||
|
||||
bdi_wb_init(&bdi->wb, bdi);
|
||||
|
||||
for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
|
||||
err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
|
||||
if (err)
|
||||
goto err;
|
||||
}
|
||||
|
||||
bdi->dirty_exceeded = 0;
|
||||
|
||||
bdi->bw_time_stamp = jiffies;
|
||||
bdi->written_stamp = 0;
|
||||
|
||||
bdi->balanced_dirty_ratelimit = INIT_BW;
|
||||
bdi->dirty_ratelimit = INIT_BW;
|
||||
bdi->write_bandwidth = INIT_BW;
|
||||
bdi->avg_write_bandwidth = INIT_BW;
|
||||
|
||||
err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
|
||||
|
||||
if (err) {
|
||||
err:
|
||||
while (i--)
|
||||
percpu_counter_destroy(&bdi->bdi_stat[i]);
|
||||
}
|
||||
|
||||
return err;
|
||||
}
|
||||
EXPORT_SYMBOL(bdi_init);
|
||||
|
||||
void bdi_destroy(struct backing_dev_info *bdi)
|
||||
{
|
||||
int i;
|
||||
|
||||
bdi_wb_shutdown(bdi);
|
||||
bdi_set_min_ratio(bdi, 0);
|
||||
|
||||
WARN_ON(!list_empty(&bdi->work_list));
|
||||
WARN_ON(delayed_work_pending(&bdi->wb.dwork));
|
||||
/* make sure nobody finds us on the bdi_list anymore */
|
||||
bdi_remove_from_list(bdi);
|
||||
wb_shutdown(&bdi->wb);
|
||||
cgwb_bdi_destroy(bdi);
|
||||
|
||||
if (bdi->dev) {
|
||||
bdi_debug_unregister(bdi);
|
||||
@ -437,9 +821,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
|
||||
bdi->dev = NULL;
|
||||
}
|
||||
|
||||
for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
|
||||
percpu_counter_destroy(&bdi->bdi_stat[i]);
|
||||
fprop_local_destroy_percpu(&bdi->completions);
|
||||
wb_exit(&bdi->wb);
|
||||
}
|
||||
EXPORT_SYMBOL(bdi_destroy);
|
||||
|
||||
@ -472,31 +854,31 @@ static wait_queue_head_t congestion_wqh[2] = {
|
||||
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
|
||||
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
|
||||
};
|
||||
static atomic_t nr_bdi_congested[2];
|
||||
static atomic_t nr_wb_congested[2];
|
||||
|
||||
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
|
||||
void clear_wb_congested(struct bdi_writeback_congested *congested, int sync)
|
||||
{
|
||||
enum bdi_state bit;
|
||||
wait_queue_head_t *wqh = &congestion_wqh[sync];
|
||||
enum wb_state bit;
|
||||
|
||||
bit = sync ? BDI_sync_congested : BDI_async_congested;
|
||||
if (test_and_clear_bit(bit, &bdi->state))
|
||||
atomic_dec(&nr_bdi_congested[sync]);
|
||||
bit = sync ? WB_sync_congested : WB_async_congested;
|
||||
if (test_and_clear_bit(bit, &congested->state))
|
||||
atomic_dec(&nr_wb_congested[sync]);
|
||||
smp_mb__after_atomic();
|
||||
if (waitqueue_active(wqh))
|
||||
wake_up(wqh);
|
||||
}
|
||||
EXPORT_SYMBOL(clear_bdi_congested);
|
||||
EXPORT_SYMBOL(clear_wb_congested);
|
||||
|
||||
void set_bdi_congested(struct backing_dev_info *bdi, int sync)
|
||||
void set_wb_congested(struct bdi_writeback_congested *congested, int sync)
|
||||
{
|
||||
enum bdi_state bit;
|
||||
enum wb_state bit;
|
||||
|
||||
bit = sync ? BDI_sync_congested : BDI_async_congested;
|
||||
if (!test_and_set_bit(bit, &bdi->state))
|
||||
atomic_inc(&nr_bdi_congested[sync]);
|
||||
bit = sync ? WB_sync_congested : WB_async_congested;
|
||||
if (!test_and_set_bit(bit, &congested->state))
|
||||
atomic_inc(&nr_wb_congested[sync]);
|
||||
}
|
||||
EXPORT_SYMBOL(set_bdi_congested);
|
||||
EXPORT_SYMBOL(set_wb_congested);
|
||||
|
||||
/**
|
||||
* congestion_wait - wait for a backing_dev to become uncongested
|
||||
@ -555,7 +937,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
|
||||
* encountered in the current zone, yield if necessary instead
|
||||
* of sleeping on the congestion queue
|
||||
*/
|
||||
if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
|
||||
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
|
||||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
|
||||
cond_resched();
|
||||
|
||||
|
@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
|
||||
case POSIX_FADV_NOREUSE:
|
||||
break;
|
||||
case POSIX_FADV_DONTNEED:
|
||||
if (!bdi_write_congested(bdi))
|
||||
if (!inode_write_congested(mapping->host))
|
||||
__filemap_fdatawrite_range(mapping, offset, endbyte,
|
||||
WB_SYNC_NONE);
|
||||
|
||||
|
34
mm/filemap.c
34
mm/filemap.c
@ -100,6 +100,7 @@
|
||||
* ->tree_lock (page_remove_rmap->set_page_dirty)
|
||||
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
|
||||
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
|
||||
* ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat)
|
||||
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
|
||||
* ->inode->i_lock (zap_pte_range->set_page_dirty)
|
||||
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
|
||||
@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping,
|
||||
/*
|
||||
* Delete a page from the page cache and free it. Caller has to make
|
||||
* sure the page is locked and that nobody else uses it - or that usage
|
||||
* is safe. The caller must hold the mapping's tree_lock.
|
||||
* is safe. The caller must hold the mapping's tree_lock and
|
||||
* mem_cgroup_begin_page_stat().
|
||||
*/
|
||||
void __delete_from_page_cache(struct page *page, void *shadow)
|
||||
void __delete_from_page_cache(struct page *page, void *shadow,
|
||||
struct mem_cgroup *memcg)
|
||||
{
|
||||
struct address_space *mapping = page->mapping;
|
||||
|
||||
@ -212,7 +215,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
|
||||
* anyway will be cleared before returning page into buddy allocator.
|
||||
*/
|
||||
if (WARN_ON_ONCE(PageDirty(page)))
|
||||
account_page_cleaned(page, mapping);
|
||||
account_page_cleaned(page, mapping, memcg,
|
||||
inode_to_wb(mapping->host));
|
||||
}
|
||||
|
||||
/**
|
||||
@ -226,14 +230,20 @@ void __delete_from_page_cache(struct page *page, void *shadow)
|
||||
void delete_from_page_cache(struct page *page)
|
||||
{
|
||||
struct address_space *mapping = page->mapping;
|
||||
struct mem_cgroup *memcg;
|
||||
unsigned long flags;
|
||||
|
||||
void (*freepage)(struct page *);
|
||||
|
||||
BUG_ON(!PageLocked(page));
|
||||
|
||||
freepage = mapping->a_ops->freepage;
|
||||
spin_lock_irq(&mapping->tree_lock);
|
||||
__delete_from_page_cache(page, NULL);
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
__delete_from_page_cache(page, NULL, memcg);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
|
||||
if (freepage)
|
||||
freepage(page);
|
||||
@ -283,7 +293,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
|
||||
if (!mapping_cap_writeback_dirty(mapping))
|
||||
return 0;
|
||||
|
||||
wbc_attach_fdatawrite_inode(&wbc, mapping->host);
|
||||
ret = do_writepages(mapping, &wbc);
|
||||
wbc_detach_inode(&wbc);
|
||||
return ret;
|
||||
}
|
||||
|
||||
@ -472,6 +484,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
|
||||
if (!error) {
|
||||
struct address_space *mapping = old->mapping;
|
||||
void (*freepage)(struct page *);
|
||||
struct mem_cgroup *memcg;
|
||||
unsigned long flags;
|
||||
|
||||
pgoff_t offset = old->index;
|
||||
freepage = mapping->a_ops->freepage;
|
||||
@ -480,8 +494,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
|
||||
new->mapping = mapping;
|
||||
new->index = offset;
|
||||
|
||||
spin_lock_irq(&mapping->tree_lock);
|
||||
__delete_from_page_cache(old, NULL);
|
||||
memcg = mem_cgroup_begin_page_stat(old);
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
__delete_from_page_cache(old, NULL, memcg);
|
||||
error = radix_tree_insert(&mapping->page_tree, offset, new);
|
||||
BUG_ON(error);
|
||||
mapping->nrpages++;
|
||||
@ -493,7 +508,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
|
||||
__inc_zone_page_state(new, NR_FILE_PAGES);
|
||||
if (PageSwapBacked(new))
|
||||
__inc_zone_page_state(new, NR_SHMEM);
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
mem_cgroup_migrate(old, new, true);
|
||||
radix_tree_preload_end();
|
||||
if (freepage)
|
||||
|
@ -17,6 +17,7 @@
|
||||
#include <linux/fs.h>
|
||||
#include <linux/file.h>
|
||||
#include <linux/blkdev.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/swap.h>
|
||||
#include <linux/swapops.h>
|
||||
|
||||
|
223
mm/memcontrol.c
223
mm/memcontrol.c
@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
|
||||
|
||||
#define MEM_CGROUP_RECLAIM_RETRIES 5
|
||||
static struct mem_cgroup *root_mem_cgroup __read_mostly;
|
||||
struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
|
||||
|
||||
/* Whether the swap controller is active */
|
||||
#ifdef CONFIG_MEMCG_SWAP
|
||||
@ -90,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
|
||||
"rss",
|
||||
"rss_huge",
|
||||
"mapped_file",
|
||||
"dirty",
|
||||
"writeback",
|
||||
"swap",
|
||||
};
|
||||
@ -322,11 +324,6 @@ struct mem_cgroup {
|
||||
* percpu counter.
|
||||
*/
|
||||
struct mem_cgroup_stat_cpu __percpu *stat;
|
||||
/*
|
||||
* used when a cpu is offlined or other synchronizations
|
||||
* See mem_cgroup_read_stat().
|
||||
*/
|
||||
struct mem_cgroup_stat_cpu nocpu_base;
|
||||
spinlock_t pcp_counter_lock;
|
||||
|
||||
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
|
||||
@ -346,6 +343,11 @@ struct mem_cgroup {
|
||||
atomic_t numainfo_updating;
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
struct list_head cgwb_list;
|
||||
struct wb_domain cgwb_domain;
|
||||
#endif
|
||||
|
||||
/* List of events which userspace want to receive */
|
||||
struct list_head event_list;
|
||||
spinlock_t event_list_lock;
|
||||
@ -596,6 +598,39 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
|
||||
return &memcg->css;
|
||||
}
|
||||
|
||||
/**
|
||||
* mem_cgroup_css_from_page - css of the memcg associated with a page
|
||||
* @page: page of interest
|
||||
*
|
||||
* If memcg is bound to the default hierarchy, css of the memcg associated
|
||||
* with @page is returned. The returned css remains associated with @page
|
||||
* until it is released.
|
||||
*
|
||||
* If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
|
||||
* is returned.
|
||||
*
|
||||
* XXX: The above description of behavior on the default hierarchy isn't
|
||||
* strictly true yet as replace_page_cache_page() can modify the
|
||||
* association before @page is released even on the default hierarchy;
|
||||
* however, the current and planned usages don't mix the the two functions
|
||||
* and replace_page_cache_page() will soon be updated to make the invariant
|
||||
* actually true.
|
||||
*/
|
||||
struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
|
||||
{
|
||||
struct mem_cgroup *memcg;
|
||||
|
||||
rcu_read_lock();
|
||||
|
||||
memcg = page->mem_cgroup;
|
||||
|
||||
if (!memcg || !cgroup_on_dfl(memcg->css.cgroup))
|
||||
memcg = root_mem_cgroup;
|
||||
|
||||
rcu_read_unlock();
|
||||
return &memcg->css;
|
||||
}
|
||||
|
||||
static struct mem_cgroup_per_zone *
|
||||
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
|
||||
{
|
||||
@ -795,15 +830,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
|
||||
long val = 0;
|
||||
int cpu;
|
||||
|
||||
get_online_cpus();
|
||||
for_each_online_cpu(cpu)
|
||||
for_each_possible_cpu(cpu)
|
||||
val += per_cpu(memcg->stat->count[idx], cpu);
|
||||
#ifdef CONFIG_HOTPLUG_CPU
|
||||
spin_lock(&memcg->pcp_counter_lock);
|
||||
val += memcg->nocpu_base.count[idx];
|
||||
spin_unlock(&memcg->pcp_counter_lock);
|
||||
#endif
|
||||
put_online_cpus();
|
||||
return val;
|
||||
}
|
||||
|
||||
@ -813,15 +841,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
|
||||
unsigned long val = 0;
|
||||
int cpu;
|
||||
|
||||
get_online_cpus();
|
||||
for_each_online_cpu(cpu)
|
||||
for_each_possible_cpu(cpu)
|
||||
val += per_cpu(memcg->stat->events[idx], cpu);
|
||||
#ifdef CONFIG_HOTPLUG_CPU
|
||||
spin_lock(&memcg->pcp_counter_lock);
|
||||
val += memcg->nocpu_base.events[idx];
|
||||
spin_unlock(&memcg->pcp_counter_lock);
|
||||
#endif
|
||||
put_online_cpus();
|
||||
return val;
|
||||
}
|
||||
|
||||
@ -2020,6 +2041,7 @@ again:
|
||||
|
||||
return memcg;
|
||||
}
|
||||
EXPORT_SYMBOL(mem_cgroup_begin_page_stat);
|
||||
|
||||
/**
|
||||
* mem_cgroup_end_page_stat - finish a page state statistics transaction
|
||||
@ -2038,6 +2060,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
|
||||
|
||||
rcu_read_unlock();
|
||||
}
|
||||
EXPORT_SYMBOL(mem_cgroup_end_page_stat);
|
||||
|
||||
/**
|
||||
* mem_cgroup_update_page_stat - update page state statistics
|
||||
@ -2178,37 +2201,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
|
||||
mutex_unlock(&percpu_charge_mutex);
|
||||
}
|
||||
|
||||
/*
|
||||
* This function drains percpu counter value from DEAD cpu and
|
||||
* move it to local cpu. Note that this function can be preempted.
|
||||
*/
|
||||
static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
|
||||
{
|
||||
int i;
|
||||
|
||||
spin_lock(&memcg->pcp_counter_lock);
|
||||
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
|
||||
long x = per_cpu(memcg->stat->count[i], cpu);
|
||||
|
||||
per_cpu(memcg->stat->count[i], cpu) = 0;
|
||||
memcg->nocpu_base.count[i] += x;
|
||||
}
|
||||
for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
|
||||
unsigned long x = per_cpu(memcg->stat->events[i], cpu);
|
||||
|
||||
per_cpu(memcg->stat->events[i], cpu) = 0;
|
||||
memcg->nocpu_base.events[i] += x;
|
||||
}
|
||||
spin_unlock(&memcg->pcp_counter_lock);
|
||||
}
|
||||
|
||||
static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
|
||||
unsigned long action,
|
||||
void *hcpu)
|
||||
{
|
||||
int cpu = (unsigned long)hcpu;
|
||||
struct memcg_stock_pcp *stock;
|
||||
struct mem_cgroup *iter;
|
||||
|
||||
if (action == CPU_ONLINE)
|
||||
return NOTIFY_OK;
|
||||
@ -2216,9 +2214,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
|
||||
if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
|
||||
return NOTIFY_OK;
|
||||
|
||||
for_each_mem_cgroup(iter)
|
||||
mem_cgroup_drain_pcp_counter(iter, cpu);
|
||||
|
||||
stock = &per_cpu(memcg_stock, cpu);
|
||||
drain_stock(stock);
|
||||
return NOTIFY_OK;
|
||||
@ -4004,6 +3999,98 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
|
||||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
|
||||
struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg)
|
||||
{
|
||||
return &memcg->cgwb_list;
|
||||
}
|
||||
|
||||
static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
|
||||
{
|
||||
return wb_domain_init(&memcg->cgwb_domain, gfp);
|
||||
}
|
||||
|
||||
static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
|
||||
{
|
||||
wb_domain_exit(&memcg->cgwb_domain);
|
||||
}
|
||||
|
||||
static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
|
||||
{
|
||||
wb_domain_size_changed(&memcg->cgwb_domain);
|
||||
}
|
||||
|
||||
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
|
||||
{
|
||||
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
|
||||
|
||||
if (!memcg->css.parent)
|
||||
return NULL;
|
||||
|
||||
return &memcg->cgwb_domain;
|
||||
}
|
||||
|
||||
/**
|
||||
* mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
|
||||
* @wb: bdi_writeback in question
|
||||
* @pavail: out parameter for number of available pages
|
||||
* @pdirty: out parameter for number of dirty pages
|
||||
* @pwriteback: out parameter for number of pages under writeback
|
||||
*
|
||||
* Determine the numbers of available, dirty, and writeback pages in @wb's
|
||||
* memcg. Dirty and writeback are self-explanatory. Available is a bit
|
||||
* more involved.
|
||||
*
|
||||
* A memcg's headroom is "min(max, high) - used". The available memory is
|
||||
* calculated as the lowest headroom of itself and the ancestors plus the
|
||||
* number of pages already being used for file pages. Note that this
|
||||
* doesn't consider the actual amount of available memory in the system.
|
||||
* The caller should further cap *@pavail accordingly.
|
||||
*/
|
||||
void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
|
||||
unsigned long *pdirty, unsigned long *pwriteback)
|
||||
{
|
||||
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
|
||||
struct mem_cgroup *parent;
|
||||
unsigned long head_room = PAGE_COUNTER_MAX;
|
||||
unsigned long file_pages;
|
||||
|
||||
*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
|
||||
|
||||
/* this should eventually include NR_UNSTABLE_NFS */
|
||||
*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
|
||||
|
||||
file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
|
||||
(1 << LRU_ACTIVE_FILE));
|
||||
while ((parent = parent_mem_cgroup(memcg))) {
|
||||
unsigned long ceiling = min(memcg->memory.limit, memcg->high);
|
||||
unsigned long used = page_counter_read(&memcg->memory);
|
||||
|
||||
head_room = min(head_room, ceiling - min(ceiling, used));
|
||||
memcg = parent;
|
||||
}
|
||||
|
||||
*pavail = file_pages + head_room;
|
||||
}
|
||||
|
||||
#else /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
|
||||
{
|
||||
}
|
||||
|
||||
static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
|
||||
{
|
||||
}
|
||||
|
||||
#endif /* CONFIG_CGROUP_WRITEBACK */
|
||||
|
||||
/*
|
||||
* DO NOT USE IN NEW FILES.
|
||||
*
|
||||
@ -4388,9 +4475,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
|
||||
memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
|
||||
if (!memcg->stat)
|
||||
goto out_free;
|
||||
|
||||
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
|
||||
goto out_free_stat;
|
||||
|
||||
spin_lock_init(&memcg->pcp_counter_lock);
|
||||
return memcg;
|
||||
|
||||
out_free_stat:
|
||||
free_percpu(memcg->stat);
|
||||
out_free:
|
||||
kfree(memcg);
|
||||
return NULL;
|
||||
@ -4417,6 +4510,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
|
||||
free_mem_cgroup_per_zone_info(memcg, node);
|
||||
|
||||
free_percpu(memcg->stat);
|
||||
memcg_wb_domain_exit(memcg);
|
||||
kfree(memcg);
|
||||
}
|
||||
|
||||
@ -4449,6 +4543,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
|
||||
/* root ? */
|
||||
if (parent_css == NULL) {
|
||||
root_mem_cgroup = memcg;
|
||||
mem_cgroup_root_css = &memcg->css;
|
||||
page_counter_init(&memcg->memory, NULL);
|
||||
memcg->high = PAGE_COUNTER_MAX;
|
||||
memcg->soft_limit = PAGE_COUNTER_MAX;
|
||||
@ -4467,7 +4562,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
|
||||
#ifdef CONFIG_MEMCG_KMEM
|
||||
memcg->kmemcg_id = -1;
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
INIT_LIST_HEAD(&memcg->cgwb_list);
|
||||
#endif
|
||||
return &memcg->css;
|
||||
|
||||
free_out:
|
||||
@ -4555,6 +4652,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
|
||||
vmpressure_cleanup(&memcg->vmpressure);
|
||||
|
||||
memcg_deactivate_kmem(memcg);
|
||||
|
||||
wb_memcg_offline(memcg);
|
||||
}
|
||||
|
||||
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
|
||||
@ -4588,6 +4687,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
|
||||
memcg->low = 0;
|
||||
memcg->high = PAGE_COUNTER_MAX;
|
||||
memcg->soft_limit = PAGE_COUNTER_MAX;
|
||||
memcg_wb_domain_size_changed(memcg);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_MMU
|
||||
@ -4757,6 +4857,7 @@ static int mem_cgroup_move_account(struct page *page,
|
||||
{
|
||||
unsigned long flags;
|
||||
int ret;
|
||||
bool anon;
|
||||
|
||||
VM_BUG_ON(from == to);
|
||||
VM_BUG_ON_PAGE(PageLRU(page), page);
|
||||
@ -4782,15 +4883,33 @@ static int mem_cgroup_move_account(struct page *page,
|
||||
if (page->mem_cgroup != from)
|
||||
goto out_unlock;
|
||||
|
||||
anon = PageAnon(page);
|
||||
|
||||
spin_lock_irqsave(&from->move_lock, flags);
|
||||
|
||||
if (!PageAnon(page) && page_mapped(page)) {
|
||||
if (!anon && page_mapped(page)) {
|
||||
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
|
||||
nr_pages);
|
||||
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
|
||||
nr_pages);
|
||||
}
|
||||
|
||||
/*
|
||||
* move_lock grabbed above and caller set from->moving_account, so
|
||||
* mem_cgroup_update_page_stat() will serialize updates to PageDirty.
|
||||
* So mapping should be stable for dirty pages.
|
||||
*/
|
||||
if (!anon && PageDirty(page)) {
|
||||
struct address_space *mapping = page_mapping(page);
|
||||
|
||||
if (mapping_cap_account_dirty(mapping)) {
|
||||
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY],
|
||||
nr_pages);
|
||||
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
|
||||
nr_pages);
|
||||
}
|
||||
}
|
||||
|
||||
if (PageWriteback(page)) {
|
||||
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
|
||||
nr_pages);
|
||||
@ -5306,6 +5425,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
|
||||
|
||||
memcg->high = high;
|
||||
|
||||
memcg_wb_domain_size_changed(memcg);
|
||||
return nbytes;
|
||||
}
|
||||
|
||||
@ -5338,6 +5458,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
memcg_wb_domain_size_changed(memcg);
|
||||
return nbytes;
|
||||
}
|
||||
|
||||
|
1233
mm/page-writeback.c
1233
mm/page-writeback.c
File diff suppressed because it is too large
Load Diff
@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping,
|
||||
/*
|
||||
* Defer asynchronous read-ahead on IO congestion.
|
||||
*/
|
||||
if (bdi_read_congested(inode_to_bdi(mapping->host)))
|
||||
if (inode_read_congested(mapping->host))
|
||||
return;
|
||||
|
||||
/* do read-ahead */
|
||||
|
@ -30,6 +30,8 @@
|
||||
* swap_lock (in swap_duplicate, swap_info_get)
|
||||
* mmlist_lock (in mmput, drain_mmlist and others)
|
||||
* mapping->private_lock (in __set_page_dirty_buffers)
|
||||
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
|
||||
* mapping->tree_lock (widely used)
|
||||
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
|
||||
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
|
||||
* sb_lock (within inode_lock in fs/fs-writeback.c)
|
||||
|
@ -116,9 +116,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
|
||||
* the VM has canceled the dirty bit (eg ext3 journaling).
|
||||
* Hence dirty accounting check is placed after invalidation.
|
||||
*/
|
||||
if (TestClearPageDirty(page))
|
||||
account_page_cleaned(page, mapping);
|
||||
|
||||
cancel_dirty_page(page);
|
||||
ClearPageMappedToDisk(page);
|
||||
delete_from_page_cache(page);
|
||||
return 0;
|
||||
@ -512,19 +510,24 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
|
||||
static int
|
||||
invalidate_complete_page2(struct address_space *mapping, struct page *page)
|
||||
{
|
||||
struct mem_cgroup *memcg;
|
||||
unsigned long flags;
|
||||
|
||||
if (page->mapping != mapping)
|
||||
return 0;
|
||||
|
||||
if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
|
||||
return 0;
|
||||
|
||||
spin_lock_irq(&mapping->tree_lock);
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
if (PageDirty(page))
|
||||
goto failed;
|
||||
|
||||
BUG_ON(page_has_private(page));
|
||||
__delete_from_page_cache(page, NULL);
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
__delete_from_page_cache(page, NULL, memcg);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
|
||||
if (mapping->a_ops->freepage)
|
||||
mapping->a_ops->freepage(page);
|
||||
@ -532,7 +535,8 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
|
||||
page_cache_release(page); /* pagecache ref */
|
||||
return 1;
|
||||
failed:
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
79
mm/vmscan.c
79
mm/vmscan.c
@ -154,11 +154,42 @@ static bool global_reclaim(struct scan_control *sc)
|
||||
{
|
||||
return !sc->target_mem_cgroup;
|
||||
}
|
||||
|
||||
/**
|
||||
* sane_reclaim - is the usual dirty throttling mechanism operational?
|
||||
* @sc: scan_control in question
|
||||
*
|
||||
* The normal page dirty throttling mechanism in balance_dirty_pages() is
|
||||
* completely broken with the legacy memcg and direct stalling in
|
||||
* shrink_page_list() is used for throttling instead, which lacks all the
|
||||
* niceties such as fairness, adaptive pausing, bandwidth proportional
|
||||
* allocation and configurability.
|
||||
*
|
||||
* This function tests whether the vmscan currently in progress can assume
|
||||
* that the normal dirty throttling mechanism is operational.
|
||||
*/
|
||||
static bool sane_reclaim(struct scan_control *sc)
|
||||
{
|
||||
struct mem_cgroup *memcg = sc->target_mem_cgroup;
|
||||
|
||||
if (!memcg)
|
||||
return true;
|
||||
#ifdef CONFIG_CGROUP_WRITEBACK
|
||||
if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup))
|
||||
return true;
|
||||
#endif
|
||||
return false;
|
||||
}
|
||||
#else
|
||||
static bool global_reclaim(struct scan_control *sc)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
static bool sane_reclaim(struct scan_control *sc)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
#endif
|
||||
|
||||
static unsigned long zone_reclaimable_pages(struct zone *zone)
|
||||
@ -452,14 +483,13 @@ static inline int is_page_cache_freeable(struct page *page)
|
||||
return page_count(page) - page_has_private(page) == 2;
|
||||
}
|
||||
|
||||
static int may_write_to_queue(struct backing_dev_info *bdi,
|
||||
struct scan_control *sc)
|
||||
static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
|
||||
{
|
||||
if (current->flags & PF_SWAPWRITE)
|
||||
return 1;
|
||||
if (!bdi_write_congested(bdi))
|
||||
if (!inode_write_congested(inode))
|
||||
return 1;
|
||||
if (bdi == current->backing_dev_info)
|
||||
if (inode_to_bdi(inode) == current->backing_dev_info)
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
||||
@ -538,7 +568,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
|
||||
}
|
||||
if (mapping->a_ops->writepage == NULL)
|
||||
return PAGE_ACTIVATE;
|
||||
if (!may_write_to_queue(inode_to_bdi(mapping->host), sc))
|
||||
if (!may_write_to_inode(mapping->host, sc))
|
||||
return PAGE_KEEP;
|
||||
|
||||
if (clear_page_dirty_for_io(page)) {
|
||||
@ -579,10 +609,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
|
||||
static int __remove_mapping(struct address_space *mapping, struct page *page,
|
||||
bool reclaimed)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct mem_cgroup *memcg;
|
||||
|
||||
BUG_ON(!PageLocked(page));
|
||||
BUG_ON(mapping != page_mapping(page));
|
||||
|
||||
spin_lock_irq(&mapping->tree_lock);
|
||||
memcg = mem_cgroup_begin_page_stat(page);
|
||||
spin_lock_irqsave(&mapping->tree_lock, flags);
|
||||
/*
|
||||
* The non racy check for a busy page.
|
||||
*
|
||||
@ -620,7 +654,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
|
||||
swp_entry_t swap = { .val = page_private(page) };
|
||||
mem_cgroup_swapout(page, swap);
|
||||
__delete_from_swap_cache(page);
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
swapcache_free(swap);
|
||||
} else {
|
||||
void (*freepage)(struct page *);
|
||||
@ -640,8 +675,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
|
||||
if (reclaimed && page_is_file_cache(page) &&
|
||||
!mapping_exiting(mapping))
|
||||
shadow = workingset_eviction(mapping, page);
|
||||
__delete_from_page_cache(page, shadow);
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
__delete_from_page_cache(page, shadow, memcg);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
|
||||
if (freepage != NULL)
|
||||
freepage(page);
|
||||
@ -650,7 +686,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
|
||||
return 1;
|
||||
|
||||
cannot_free:
|
||||
spin_unlock_irq(&mapping->tree_lock);
|
||||
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
||||
mem_cgroup_end_page_stat(memcg);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -917,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
|
||||
*/
|
||||
mapping = page_mapping(page);
|
||||
if (((dirty || writeback) && mapping &&
|
||||
bdi_write_congested(inode_to_bdi(mapping->host))) ||
|
||||
inode_write_congested(mapping->host)) ||
|
||||
(writeback && PageReclaim(page)))
|
||||
nr_congested++;
|
||||
|
||||
@ -935,10 +972,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
|
||||
* note that the LRU is being scanned too quickly and the
|
||||
* caller can stall after page list has been processed.
|
||||
*
|
||||
* 2) Global reclaim encounters a page, memcg encounters a
|
||||
* page that is not marked for immediate reclaim or
|
||||
* the caller does not have __GFP_IO. In this case mark
|
||||
* the page for immediate reclaim and continue scanning.
|
||||
* 2) Global or new memcg reclaim encounters a page that is
|
||||
* not marked for immediate reclaim or the caller does not
|
||||
* have __GFP_IO. In this case mark the page for immediate
|
||||
* reclaim and continue scanning.
|
||||
*
|
||||
* __GFP_IO is checked because a loop driver thread might
|
||||
* enter reclaim, and deadlock if it waits on a page for
|
||||
@ -952,7 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
|
||||
* grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
|
||||
* may_enter_fs here is liable to OOM on them.
|
||||
*
|
||||
* 3) memcg encounters a page that is not already marked
|
||||
* 3) Legacy memcg encounters a page that is not already marked
|
||||
* PageReclaim. memcg does not have any dirty pages
|
||||
* throttling so we could easily OOM just because too many
|
||||
* pages are in writeback and there is nothing else to
|
||||
@ -967,7 +1004,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
|
||||
goto keep_locked;
|
||||
|
||||
/* Case 2 above */
|
||||
} else if (global_reclaim(sc) ||
|
||||
} else if (sane_reclaim(sc) ||
|
||||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
|
||||
/*
|
||||
* This is slightly racy - end_page_writeback()
|
||||
@ -1416,7 +1453,7 @@ static int too_many_isolated(struct zone *zone, int file,
|
||||
if (current_is_kswapd())
|
||||
return 0;
|
||||
|
||||
if (!global_reclaim(sc))
|
||||
if (!sane_reclaim(sc))
|
||||
return 0;
|
||||
|
||||
if (file) {
|
||||
@ -1608,10 +1645,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
|
||||
set_bit(ZONE_WRITEBACK, &zone->flags);
|
||||
|
||||
/*
|
||||
* memcg will stall in page writeback so only consider forcibly
|
||||
* stalling for global reclaim
|
||||
* Legacy memcg will stall in page writeback so avoid forcibly
|
||||
* stalling here.
|
||||
*/
|
||||
if (global_reclaim(sc)) {
|
||||
if (sane_reclaim(sc)) {
|
||||
/*
|
||||
* Tag a zone as congested if all the dirty pages scanned were
|
||||
* backed by a congested BDI and wait_iff_congested will stall.
|
||||
|
Loading…
Reference in New Issue
Block a user