Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block

Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
This commit is contained in:
Linus Torvalds 2015-06-25 16:00:17 -07:00
commit e4bc13adfd
74 changed files with 3907 additions and 1259 deletions

View File

@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.
What works
==========
- Currently only sync IO queues are support. All the buffered writes are
still system wide and not per group. Hence we will not see service
differentiation between buffered writes between groups.
Writeback
=========
Page cache is dirtied through buffered writes and shared mmaps and
written asynchronously to the backing filesystem by the writeback
mechanism. Writeback sits between the memory and IO domains and
regulates the proportion of dirty memory by balancing dirtying and
write IOs.
On traditional cgroup hierarchies, relationships between different
controllers cannot be established making it impossible for writeback
to operate accounting for cgroup resource restrictions and all
writeback IOs are attributed to the root cgroup.
If both the blkio and memory controllers are used on the v2 hierarchy
and the filesystem supports cgroup writeback, writeback operations
correctly follow the resource restrictions imposed by both memory and
blkio controllers.
Writeback examines both system-wide and per-cgroup dirty memory status
and enforces the more restrictive of the two. Also, writeback control
parameters which are absolute values - vm.dirty_bytes and
vm.dirty_background_bytes - are distributed across cgroups according
to their current writeback bandwidth.
There's a peculiarity stemming from the discrepancy in ownership
granularity between memory controller and writeback. While memory
controller tracks ownership per page, writeback operates on inode
basis. cgroup writeback bridges the gap by tracking ownership by
inode but migrating ownership if too many foreign pages, pages which
don't match the current inode ownership, have been encountered while
writing back the inode.
This is a conscious design choice as writeback operations are
inherently tied to inodes making strictly following page ownership
complicated and inefficient. The only use case which suffers from
this compromise is multiple cgroups concurrently dirtying disjoint
regions of the same inode, which is an unlikely use case and decided
to be unsupported. Note that as memory controller assigns page
ownership on the first use and doesn't update it until the page is
released, even if cgroup writeback strictly follows page ownership,
multiple cgroups dirtying overlapping areas wouldn't work as expected.
In general, write-sharing an inode across multiple cgroups is not well
supported.
Filesystem support for cgroup writeback
---------------------------------------
A filesystem can make writeback IOs cgroup-aware by updating
address_space_operations->writepage[s]() to annotate bio's using the
following two functions.
* wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and associates
the bio with the inode's owner cgroup. Can be called anytime
between bio allocation and submission.
* wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out. While
this function doesn't care exactly when it's called during the
writeback session, it's the easiest and most natural to call it as
data segments are added to a bio.
With writeback bio's annotated, cgroup support can be enabled per
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
selective disabling of cgroup writeback support which is helpful when
certain filesystem features, e.g. journaled data mode, are
incompatible.
wbc_init_bio() binds the specified bio to its cgroup. Depending on
the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
directly.

View File

@ -493,6 +493,7 @@ pgpgin - # of charging events to the memory cgroup. The charging
pgpgout - # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage
dirty - # of bytes that are waiting to get written back to the disk.
writeback - # of bytes of file/anon cache that are queued for syncing to
disk.
inactive_anon - # of bytes of anonymous and swap cache memory on inactive

View File

@ -1988,6 +1988,28 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_
EXPORT_SYMBOL(bioset_create_nobvec);
#ifdef CONFIG_BLK_CGROUP
/**
* bio_associate_blkcg - associate a bio with the specified blkcg
* @bio: target bio
* @blkcg_css: css of the blkcg to associate
*
* Associate @bio with the blkcg specified by @blkcg_css. Block layer will
* treat @bio as if it were issued by a task which belongs to the blkcg.
*
* This function takes an extra reference of @blkcg_css which will be put
* when @bio is released. The caller must own @bio and is responsible for
* synchronizing calls to this function.
*/
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
{
if (unlikely(bio->bi_css))
return -EBUSY;
css_get(blkcg_css);
bio->bi_css = blkcg_css;
return 0;
}
/**
* bio_associate_current - associate a bio with %current
* @bio: target bio
@ -2004,26 +2026,17 @@ EXPORT_SYMBOL(bioset_create_nobvec);
int bio_associate_current(struct bio *bio)
{
struct io_context *ioc;
struct cgroup_subsys_state *css;
if (bio->bi_ioc)
if (bio->bi_css)
return -EBUSY;
ioc = current->io_context;
if (!ioc)
return -ENOENT;
/* acquire active ref on @ioc and associate */
get_io_context_active(ioc);
bio->bi_ioc = ioc;
/* associate blkcg if exists */
rcu_read_lock();
css = task_css(current, blkio_cgrp_id);
if (css && css_tryget_online(css))
bio->bi_css = css;
rcu_read_unlock();
bio->bi_css = task_get_css(current, blkio_cgrp_id);
return 0;
}

View File

@ -19,11 +19,12 @@
#include <linux/module.h>
#include <linux/err.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/slab.h>
#include <linux/genhd.h>
#include <linux/delay.h>
#include <linux/atomic.h>
#include "blk-cgroup.h"
#include <linux/blk-cgroup.h>
#include "blk.h"
#define MAX_KEY_LEN 100
@ -33,6 +34,8 @@ static DEFINE_MUTEX(blkcg_pol_mutex);
struct blkcg blkcg_root;
EXPORT_SYMBOL_GPL(blkcg_root);
struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
static bool blkcg_policy_enabled(struct request_queue *q,
@ -182,6 +185,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
struct blkcg_gq *new_blkg)
{
struct blkcg_gq *blkg;
struct bdi_writeback_congested *wb_congested;
int i, ret;
WARN_ON_ONCE(!rcu_read_lock_held());
@ -193,22 +197,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}
wb_congested = wb_congested_get_create(&q->backing_dev_info,
blkcg->css.id, GFP_ATOMIC);
if (!wb_congested) {
ret = -ENOMEM;
goto err_put_css;
}
/* allocate */
if (!new_blkg) {
new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
if (unlikely(!new_blkg)) {
ret = -ENOMEM;
goto err_put_css;
goto err_put_congested;
}
}
blkg = new_blkg;
blkg->wb_congested = wb_congested;
/* link parent */
if (blkcg_parent(blkcg)) {
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
if (WARN_ON_ONCE(!blkg->parent)) {
ret = -EINVAL;
goto err_put_css;
goto err_put_congested;
}
blkg_get(blkg->parent);
}
@ -238,18 +250,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg->online = true;
spin_unlock(&blkcg->lock);
if (!ret) {
if (blkcg == &blkcg_root) {
q->root_blkg = blkg;
q->root_rl.blkg = blkg;
}
if (!ret)
return blkg;
}
/* @blkg failed fully initialized, use the usual release path */
blkg_put(blkg);
return ERR_PTR(ret);
err_put_congested:
wb_congested_put(wb_congested);
err_put_css:
css_put(&blkcg->css);
err_free_blkg:
@ -342,15 +351,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
if (rcu_access_pointer(blkcg->blkg_hint) == blkg)
rcu_assign_pointer(blkcg->blkg_hint, NULL);
/*
* If root blkg is destroyed. Just clear the pointer since root_rl
* does not take reference on root blkg.
*/
if (blkcg == &blkcg_root) {
blkg->q->root_blkg = NULL;
blkg->q->root_rl.blkg = NULL;
}
/*
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
@ -405,6 +405,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
if (blkg->parent)
blkg_put(blkg->parent);
wb_congested_put(blkg->wb_congested);
blkg_free(blkg);
}
EXPORT_SYMBOL_GPL(__blkg_release_rcu);
@ -812,6 +814,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock_irq(&blkcg->lock);
wb_blkcg_offline(blkcg);
}
static void blkcg_css_free(struct cgroup_subsys_state *css)
@ -868,7 +872,9 @@ done:
spin_lock_init(&blkcg->lock);
INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(&blkcg->blkg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&blkcg->cgwb_list);
#endif
return &blkcg->css;
free_pd_blkcg:
@ -892,9 +898,45 @@ free_blkcg:
*/
int blkcg_init_queue(struct request_queue *q)
{
might_sleep();
struct blkcg_gq *new_blkg, *blkg;
bool preloaded;
int ret;
return blk_throtl_init(q);
new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
if (!new_blkg)
return -ENOMEM;
preloaded = !radix_tree_preload(GFP_KERNEL);
/*
* Make sure the root blkg exists and count the existing blkgs. As
* @q is bypassing at this point, blkg_lookup_create() can't be
* used. Open code insertion.
*/
rcu_read_lock();
spin_lock_irq(q->queue_lock);
blkg = blkg_create(&blkcg_root, q, new_blkg);
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
if (preloaded)
radix_tree_preload_end();
if (IS_ERR(blkg)) {
kfree(new_blkg);
return PTR_ERR(blkg);
}
q->root_blkg = blkg;
q->root_rl.blkg = blkg;
ret = blk_throtl_init(q);
if (ret) {
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);
}
return ret;
}
/**
@ -996,50 +1038,19 @@ int blkcg_activate_policy(struct request_queue *q,
{
LIST_HEAD(pds);
LIST_HEAD(cpds);
struct blkcg_gq *blkg, *new_blkg;
struct blkcg_gq *blkg;
struct blkg_policy_data *pd, *nd;
struct blkcg_policy_data *cpd, *cnd;
int cnt = 0, ret;
bool preloaded;
if (blkcg_policy_enabled(q, pol))
return 0;
/* preallocations for root blkg */
new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
if (!new_blkg)
return -ENOMEM;
/* count and allocate policy_data for all existing blkgs */
blk_queue_bypass_start(q);
preloaded = !radix_tree_preload(GFP_KERNEL);
/*
* Make sure the root blkg exists and count the existing blkgs. As
* @q is bypassing at this point, blkg_lookup_create() can't be
* used. Open code it.
*/
spin_lock_irq(q->queue_lock);
rcu_read_lock();
blkg = __blkg_lookup(&blkcg_root, q, false);
if (blkg)
blkg_free(new_blkg);
else
blkg = blkg_create(&blkcg_root, q, new_blkg);
rcu_read_unlock();
if (preloaded)
radix_tree_preload_end();
if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
goto out_unlock;
}
list_for_each_entry(blkg, &q->blkg_list, q_node)
cnt++;
spin_unlock_irq(q->queue_lock);
/*
@ -1140,10 +1151,6 @@ void blkcg_deactivate_policy(struct request_queue *q,
__clear_bit(pol->plid, q->blkcg_pols);
/* if no policy is left, no need for blkgs - shoot them down */
if (bitmap_empty(q->blkcg_pols, BLKCG_MAX_POLS))
blkg_destroy_all(q);
list_for_each_entry(blkg, &q->blkg_list, q_node) {
/* grab blkcg lock too while removing @pd from @blkg */
spin_lock(&blkg->blkcg->lock);

View File

@ -32,12 +32,12 @@
#include <linux/delay.h>
#include <linux/ratelimit.h>
#include <linux/pm_runtime.h>
#include <linux/blk-cgroup.h>
#define CREATE_TRACE_POINTS
#include <trace/events/block.h>
#include "blk.h"
#include "blk-cgroup.h"
#include "blk-mq.h"
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@ -63,6 +63,31 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;
static void blk_clear_congested(struct request_list *rl, int sync)
{
#ifdef CONFIG_CGROUP_WRITEBACK
clear_wb_congested(rl->blkg->wb_congested, sync);
#else
/*
* If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
* flip its congestion state for events on other blkcgs.
*/
if (rl == &rl->q->root_rl)
clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
#endif
}
static void blk_set_congested(struct request_list *rl, int sync)
{
#ifdef CONFIG_CGROUP_WRITEBACK
set_wb_congested(rl->blkg->wb_congested, sync);
#else
/* see blk_clear_congested() */
if (rl == &rl->q->root_rl)
set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
#endif
}
void blk_queue_congestion_threshold(struct request_queue *q)
{
int nr;
@ -623,8 +648,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = 0;
q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info.name = "block";
q->node = node_id;
@ -847,13 +871,8 @@ static void __freed_request(struct request_list *rl, int sync)
{
struct request_queue *q = rl->q;
/*
* bdi isn't aware of blkcg yet. As all async IOs end up root
* blkcg anyway, just use root blkcg state.
*/
if (rl == &q->root_rl &&
rl->count[sync] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, sync);
if (rl->count[sync] < queue_congestion_off_threshold(q))
blk_clear_congested(rl, sync);
if (rl->count[sync] + 1 <= q->nr_requests) {
if (waitqueue_active(&rl->wait[sync]))
@ -886,25 +905,25 @@ static void freed_request(struct request_list *rl, unsigned int flags)
int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
{
struct request_list *rl;
int on_thresh, off_thresh;
spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
/* congestion isn't cgroup aware and follows root blkcg for now */
rl = &q->root_rl;
if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_SYNC);
else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_SYNC);
if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
blk_set_queue_congested(q, BLK_RW_ASYNC);
else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
blk_clear_queue_congested(q, BLK_RW_ASYNC);
on_thresh = queue_congestion_on_threshold(q);
off_thresh = queue_congestion_off_threshold(q);
blk_queue_for_each_rl(rl, q) {
if (rl->count[BLK_RW_SYNC] >= on_thresh)
blk_set_congested(rl, BLK_RW_SYNC);
else if (rl->count[BLK_RW_SYNC] < off_thresh)
blk_clear_congested(rl, BLK_RW_SYNC);
if (rl->count[BLK_RW_ASYNC] >= on_thresh)
blk_set_congested(rl, BLK_RW_ASYNC);
else if (rl->count[BLK_RW_ASYNC] < off_thresh)
blk_clear_congested(rl, BLK_RW_ASYNC);
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_rl_full(rl, BLK_RW_SYNC);
} else {
@ -1014,12 +1033,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
}
}
}
/*
* bdi isn't aware of blkcg yet. As all async IOs end up
* root blkcg anyway, just use root blkcg state.
*/
if (rl == &q->root_rl)
blk_set_queue_congested(q, is_sync);
blk_set_congested(rl, is_sync);
}
/*

View File

@ -21,6 +21,7 @@
*/
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/mempool.h>
#include <linux/bio.h>
#include <linux/scatterlist.h>

View File

@ -6,11 +6,12 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/blktrace_api.h>
#include <linux/blk-mq.h>
#include <linux/blk-cgroup.h>
#include "blk.h"
#include "blk-cgroup.h"
#include "blk-mq.h"
struct queue_sysfs_entry {

View File

@ -9,7 +9,7 @@
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/blktrace_api.h>
#include "blk-cgroup.h"
#include <linux/blk-cgroup.h>
#include "blk.h"
/* Max dispatch from a group in 1 round */

View File

@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>

View File

@ -14,8 +14,8 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
#include <linux/blk-cgroup.h>
#include "blk.h"
#include "blk-cgroup.h"
/*
* tunables

View File

@ -35,11 +35,11 @@
#include <linux/hash.h>
#include <linux/uaccess.h>
#include <linux/pm_runtime.h>
#include <linux/blk-cgroup.h>
#include <trace/events/block.h>
#include "blk.h"
#include "blk-cgroup.h"
static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);

View File

@ -8,6 +8,7 @@
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/proc_fs.h>

View File

@ -38,6 +38,7 @@
#include <linux/mutex.h>
#include <linux/major.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/genhd.h>
#include <linux/idr.h>
#include <net/tcp.h>

View File

@ -2359,7 +2359,7 @@ static void drbd_cleanup(void)
* @congested_data: User data
* @bdi_bits: Bits the BDI flusher thread is currently interested in
*
* Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
* Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
*/
static int drbd_congested(void *congested_data, int bdi_bits)
{
@ -2376,14 +2376,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
}
if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
r |= (1 << BDI_async_congested);
r |= (1 << WB_async_congested);
/* Without good local data, we would need to read from remote,
* and that would need the worker thread as well, which is
* currently blocked waiting for that usermode helper to
* finish.
*/
if (!get_ldev_if_state(device, D_UP_TO_DATE))
r |= (1 << BDI_sync_congested);
r |= (1 << WB_sync_congested);
else
put_ldev(device);
r &= bdi_bits;
@ -2399,9 +2399,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
reason = 'b';
}
if (bdi_bits & (1 << BDI_async_congested) &&
if (bdi_bits & (1 << WB_async_congested) &&
test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
r |= (1 << BDI_async_congested);
r |= (1 << WB_async_congested);
reason = reason == 'b' ? 'a' : 'n';
}

View File

@ -61,6 +61,7 @@
#include <linux/freezer.h>
#include <linux/mutex.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_ioctl.h>
#include <scsi/scsi.h>

View File

@ -12,6 +12,7 @@
#include <linux/fs.h>
#include <linux/major.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/raw.h>
#include <linux/capability.h>

View File

@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/hash.h>
#include <linux/random.h>
#include <linux/backing-dev.h>
#include <trace/events/bcache.h>

View File

@ -2080,7 +2080,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
* the query about congestion status of request_queue
*/
if (dm_request_based(md))
r = md->queue->backing_dev_info.state &
r = md->queue->backing_dev_info.wb.state &
bdi_bits;
else
r = dm_table_any_congested(map, bdi_bits);

View File

@ -14,6 +14,7 @@
#include <linux/device-mapper.h>
#include <linux/list.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>

View File

@ -16,6 +16,7 @@
#define _MD_MD_H
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mm.h>

View File

@ -745,7 +745,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
struct r1conf *conf = mddev->private;
int i, ret = 0;
if ((bits & (1 << BDI_async_congested)) &&
if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;
@ -760,7 +760,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
/* Note the '|| 1' - when read_balance prefers
* non-congested targets, it can be removed
*/
if ((bits & (1<<BDI_async_congested)) || 1)
if ((bits & (1 << WB_async_congested)) || 1)
ret |= bdi_congested(&q->backing_dev_info, bits);
else
ret &= bdi_congested(&q->backing_dev_info, bits);

View File

@ -914,7 +914,7 @@ static int raid10_congested(struct mddev *mddev, int bits)
struct r10conf *conf = mddev->private;
int i, ret = 0;
if ((bits & (1 << BDI_async_congested)) &&
if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;

View File

@ -20,6 +20,7 @@
#include <linux/delay.h>
#include <linux/fs.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/pagemap.h>
#include <linux/list.h>

View File

@ -55,9 +55,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (PagePrivate(page))
page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);
if (TestClearPageDirty(page))
account_page_cleaned(page, mapping);
cancel_dirty_page(page);
ClearPageMappedToDisk(page);
ll_delete_from_page_cache(page);
}

View File

@ -320,31 +320,21 @@ fail_option_alloc:
struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
const char *dev_name, char *data)
{
int retval = -EINVAL;
struct p9_fid *fid;
int rc;
int rc = -ENOMEM;
v9ses->uname = kstrdup(V9FS_DEFUSER, GFP_KERNEL);
if (!v9ses->uname)
return ERR_PTR(-ENOMEM);
goto err_names;
v9ses->aname = kstrdup(V9FS_DEFANAME, GFP_KERNEL);
if (!v9ses->aname) {
kfree(v9ses->uname);
return ERR_PTR(-ENOMEM);
}
if (!v9ses->aname)
goto err_names;
init_rwsem(&v9ses->rename_sem);
rc = bdi_setup_and_register(&v9ses->bdi, "9p");
if (rc) {
kfree(v9ses->aname);
kfree(v9ses->uname);
return ERR_PTR(rc);
}
spin_lock(&v9fs_sessionlist_lock);
list_add(&v9ses->slist, &v9fs_sessionlist);
spin_unlock(&v9fs_sessionlist_lock);
if (rc)
goto err_names;
v9ses->uid = INVALID_UID;
v9ses->dfltuid = V9FS_DEFUID;
@ -352,10 +342,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
v9ses->clnt = p9_client_create(dev_name, data);
if (IS_ERR(v9ses->clnt)) {
retval = PTR_ERR(v9ses->clnt);
v9ses->clnt = NULL;
rc = PTR_ERR(v9ses->clnt);
p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n");
goto error;
goto err_bdi;
}
v9ses->flags = V9FS_ACCESS_USER;
@ -368,10 +357,8 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
}
rc = v9fs_parse_options(v9ses, data);
if (rc < 0) {
retval = rc;
goto error;
}
if (rc < 0)
goto err_clnt;
v9ses->maxdata = v9ses->clnt->msize - P9_IOHDRSZ;
@ -405,10 +392,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
fid = p9_client_attach(v9ses->clnt, NULL, v9ses->uname, INVALID_UID,
v9ses->aname);
if (IS_ERR(fid)) {
retval = PTR_ERR(fid);
fid = NULL;
rc = PTR_ERR(fid);
p9_debug(P9_DEBUG_ERROR, "cannot attach\n");
goto error;
goto err_clnt;
}
if ((v9ses->flags & V9FS_ACCESS_MASK) == V9FS_ACCESS_SINGLE)
@ -420,12 +406,20 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
/* register the session for caching */
v9fs_cache_session_get_cookie(v9ses);
#endif
spin_lock(&v9fs_sessionlist_lock);
list_add(&v9ses->slist, &v9fs_sessionlist);
spin_unlock(&v9fs_sessionlist_lock);
return fid;
error:
err_clnt:
p9_client_destroy(v9ses->clnt);
err_bdi:
bdi_destroy(&v9ses->bdi);
return ERR_PTR(retval);
err_names:
kfree(v9ses->uname);
kfree(v9ses->aname);
return ERR_PTR(rc);
}
/**

View File

@ -130,11 +130,7 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
fid = v9fs_session_init(v9ses, dev_name, data);
if (IS_ERR(fid)) {
retval = PTR_ERR(fid);
/*
* we need to call session_close to tear down some
* of the data structure setup by session_init
*/
goto close_session;
goto free_session;
}
sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses);
@ -195,8 +191,8 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
clunk_fid:
p9_client_clunk(fid);
close_session:
v9fs_session_close(v9ses);
free_session:
kfree(v9ses);
return ERR_PTR(retval);

View File

@ -14,6 +14,7 @@
#include <linux/device_cgroup.h>
#include <linux/highmem.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/magic.h>
@ -546,7 +547,8 @@ static struct file_system_type bd_type = {
.kill_sb = kill_anon_super,
};
static struct super_block *blockdev_superblock __read_mostly;
struct super_block *blockdev_superblock __read_mostly;
EXPORT_SYMBOL_GPL(blockdev_superblock);
void __init bdev_cache_init(void)
{
@ -687,11 +689,6 @@ static struct block_device *bd_acquire(struct inode *inode)
return bdev;
}
int sb_is_blkdev_sb(struct super_block *sb)
{
return sb == blockdev_superblock;
}
/* Call when you free inode */
void bd_forget(struct inode *inode)

View File

@ -30,6 +30,7 @@
#include <linux/quotaops.h>
#include <linux/highmem.h>
#include <linux/export.h>
#include <linux/backing-dev.h>
#include <linux/writeback.h>
#include <linux/hash.h>
#include <linux/suspend.h>
@ -44,6 +45,9 @@
#include <trace/events/block.h>
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
static int submit_bh_wbc(int rw, struct buffer_head *bh,
unsigned long bio_flags,
struct writeback_control *wbc);
#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
@ -623,21 +627,22 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
*
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*
* The caller must hold mem_cgroup_begin_page_stat() lock.
*/
static void __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
static void __set_page_dirty(struct page *page, struct address_space *mapping,
struct mem_cgroup *memcg, int warn)
{
unsigned long flags;
spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
account_page_dirtied(page, mapping, memcg);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
/*
@ -668,6 +673,7 @@ static void __set_page_dirty(struct page *page,
int __set_page_dirty_buffers(struct page *page)
{
int newly_dirty;
struct mem_cgroup *memcg;
struct address_space *mapping = page_mapping(page);
if (unlikely(!mapping))
@ -683,11 +689,22 @@ int __set_page_dirty_buffers(struct page *page)
bh = bh->b_this_page;
} while (bh != head);
}
/*
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
* per-memcg dirty page counters.
*/
memcg = mem_cgroup_begin_page_stat(page);
newly_dirty = !TestSetPageDirty(page);
spin_unlock(&mapping->private_lock);
if (newly_dirty)
__set_page_dirty(page, mapping, 1);
__set_page_dirty(page, mapping, memcg, 1);
mem_cgroup_end_page_stat(memcg);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return newly_dirty;
}
EXPORT_SYMBOL(__set_page_dirty_buffers);
@ -1158,11 +1175,18 @@ void mark_buffer_dirty(struct buffer_head *bh)
if (!test_set_buffer_dirty(bh)) {
struct page *page = bh->b_page;
struct address_space *mapping = NULL;
struct mem_cgroup *memcg;
memcg = mem_cgroup_begin_page_stat(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
mapping = page_mapping(page);
if (mapping)
__set_page_dirty(page, mapping, 0);
__set_page_dirty(page, mapping, memcg, 0);
}
mem_cgroup_end_page_stat(memcg);
if (mapping)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
}
EXPORT_SYMBOL(mark_buffer_dirty);
@ -1684,8 +1708,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
WRITE_SYNC : WRITE);
int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
@ -1774,7 +1797,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
submit_bh(write_op, bh);
submit_bh_wbc(write_op, bh, 0, wbc);
nr_underway++;
}
bh = next;
@ -1828,7 +1851,7 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
submit_bh(write_op, bh);
submit_bh_wbc(write_op, bh, 0, wbc);
nr_underway++;
}
bh = next;
@ -2993,7 +3016,8 @@ void guard_bio_eod(int rw, struct bio *bio)
}
}
int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
static int submit_bh_wbc(int rw, struct buffer_head *bh,
unsigned long bio_flags, struct writeback_control *wbc)
{
struct bio *bio;
@ -3015,6 +3039,11 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
*/
bio = bio_alloc(GFP_NOIO, 1);
if (wbc) {
wbc_init_bio(wbc, bio);
wbc_account_io(wbc, bh->b_page, bh->b_size);
}
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
@ -3039,11 +3068,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
submit_bio(rw, bio);
return 0;
}
int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
{
return submit_bh_wbc(rw, bh, bio_flags, NULL);
}
EXPORT_SYMBOL_GPL(_submit_bh);
int submit_bh(int rw, struct buffer_head *bh)
{
return _submit_bh(rw, bh, 0);
return submit_bh_wbc(rw, bh, 0, NULL);
}
EXPORT_SYMBOL(submit_bh);
@ -3232,8 +3266,8 @@ int try_to_free_buffers(struct page *page)
* to synchronise against __set_page_dirty_buffers and prevent the
* dirty bit from being lost.
*/
if (ret && TestClearPageDirty(page))
account_page_cleaned(page, mapping);
if (ret)
cancel_dirty_page(page);
spin_unlock(&mapping->private_lock);
out:
if (buffers_to_free) {

View File

@ -882,6 +882,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);
sb->s_iflags |= SB_I_CGROUPWB;
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||

View File

@ -39,6 +39,7 @@
#include <linux/slab.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
#include <linux/backing-dev.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
#include "xattr.h"

View File

@ -26,6 +26,7 @@
#include <linux/log2.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <trace/events/ext4.h>
#ifdef CONFIG_EXT4_DEBUG

View File

@ -24,6 +24,7 @@
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/parser.h>
#include <linux/buffer_head.h>
#include <linux/exportfs.h>

View File

@ -53,7 +53,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2);
} else if (type == DIRTY_DENTS) {
if (sbi->sb->s_bdi->dirty_exceeded)
if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
@ -70,7 +70,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
} else {
if (sbi->sb->s_bdi->dirty_exceeded)
if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
}
return res;

View File

@ -9,6 +9,7 @@
* published by the Free Software Foundation.
*/
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
/* constant macro */
#define NULL_SEGNO ((unsigned int)(~0))
@ -714,7 +715,7 @@ static inline unsigned int max_hw_blocks(struct f2fs_sb_info *sbi)
*/
static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
{
if (sbi->sb->s_bdi->dirty_exceeded)
if (sbi->sb->s_bdi->wb.dirty_exceeded)
return 0;
if (type == DATA)

View File

@ -11,6 +11,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
#include "fat.h"

View File

@ -18,6 +18,7 @@
#include <linux/parser.h>
#include <linux/uio.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <asm/unaligned.h>
#include "fat.h"

File diff suppressed because it is too large Load Diff

View File

@ -1445,9 +1445,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
list_del(&req->writepages_entry);
for (i = 0; i < req->num_pages; i++) {
dec_bdi_stat(bdi, BDI_WRITEBACK);
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
wb_writeout_inc(&bdi->wb);
}
wake_up(&fi->page_waitq);
}
@ -1634,7 +1634,7 @@ static int fuse_writepage_locked(struct page *page)
req->end = fuse_writepage_end;
req->inode = inode;
inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
spin_lock(&fc->lock);
@ -1749,9 +1749,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
copy_highpage(old_req->pages[0], page);
spin_unlock(&fc->lock);
dec_bdi_stat(bdi, BDI_WRITEBACK);
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
bdi_writeout_inc(bdi);
wb_writeout_inc(&bdi->wb);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
@ -1848,7 +1848,7 @@ static int fuse_writepages_fill(struct page *page,
req->page_descs[req->num_pages].offset = 0;
req->page_descs[req->num_pages].length = PAGE_SIZE;
inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
err = 0;

View File

@ -748,7 +748,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
if (wbc->sync_mode == WB_SYNC_ALL)
gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH);
if (bdi->dirty_exceeded)
if (bdi->wb.dirty_exceeded)
gfs2_ail1_flush(sdp, wbc);
else
filemap_fdatawrite(metamapping);

View File

@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/mount.h>
#include <linux/init.h>
#include <linux/nls.h>

View File

@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/vfs.h>

View File

@ -224,6 +224,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
void __destroy_inode(struct inode *inode)
{
BUG_ON(inode_has_buffers(inode));
inode_detach_wb(inode);
security_inode_free(inode);
fsnotify_inode_delete(inode);
locks_free_lock_context(inode->i_flctx);

View File

@ -605,6 +605,8 @@ alloc_new:
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
goto confused;
wbc_init_bio(wbc, bio);
}
/*
@ -612,6 +614,7 @@ alloc_new:
* the confused fail path above (OOM) will be very confused when
* it finds all bh marked clean (i.e. it will not write anything)
*/
wbc_account_io(wbc, page, PAGE_SIZE);
length = first_unmapped << blkbits;
if (bio_add_page(bio, page, length, 0) < length) {
bio = mpage_bio_submit(WRITE, bio);

View File

@ -32,6 +32,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
#include <linux/module.h>
#include <linux/backing-dev.h>
#include <linux/sunrpc/metrics.h>

View File

@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page)
struct inode *inode = page_file_mapping(page)->host;
inc_zone_page_state(page, NR_UNSTABLE_NFS);
inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE);
inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}

View File

@ -853,7 +853,8 @@ static void
nfs_clear_page_commit(struct page *page)
{
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE);
dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
WB_RECLAIMABLE);
}
/* Called holding inode (/cinfo) lock */

View File

@ -37,6 +37,7 @@
#include <linux/falloc.h>
#include <linux/quotaops.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <cluster/masklog.h>

View File

@ -21,6 +21,7 @@
#include "xattr.h"
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
#include <linux/exportfs.h>
#include <linux/quotaops.h>

View File

@ -80,6 +80,7 @@
#include <linux/stat.h>
#include <linux/string.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/parser.h>
#include <linux/buffer_head.h>

View File

@ -1873,6 +1873,7 @@ xfs_vm_set_page_dirty(
loff_t end_offset;
loff_t offset;
int newly_dirty;
struct mem_cgroup *memcg;
if (unlikely(!mapping))
return !TestSetPageDirty(page);
@ -1892,6 +1893,11 @@ xfs_vm_set_page_dirty(
offset += 1 << inode->i_blkbits;
} while (bh != head);
}
/*
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
* per-memcg dirty page counters.
*/
memcg = mem_cgroup_begin_page_stat(page);
newly_dirty = !TestSetPageDirty(page);
spin_unlock(&mapping->private_lock);
@ -1902,13 +1908,15 @@ xfs_vm_set_page_dirty(
spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));
account_page_dirtied(page, mapping);
account_page_dirtied(page, mapping, memcg);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
mem_cgroup_end_page_stat(memcg);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return newly_dirty;
}

View File

@ -41,6 +41,7 @@
#include <linux/dcache.h>
#include <linux/falloc.h>
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
static const struct vm_operations_struct xfs_file_vm_ops;

View File

@ -0,0 +1,255 @@
#ifndef __LINUX_BACKING_DEV_DEFS_H
#define __LINUX_BACKING_DEV_DEFS_H
#include <linux/list.h>
#include <linux/radix-tree.h>
#include <linux/rbtree.h>
#include <linux/spinlock.h>
#include <linux/percpu_counter.h>
#include <linux/percpu-refcount.h>
#include <linux/flex_proportions.h>
#include <linux/timer.h>
#include <linux/workqueue.h>
struct page;
struct device;
struct dentry;
/*
* Bits in bdi_writeback.state
*/
enum wb_state {
WB_registered, /* bdi_register() was done */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
};
enum wb_congested_state {
WB_async_congested, /* The async (write) queue is getting full */
WB_sync_congested, /* The sync queue is getting full */
};
typedef int (congested_fn)(void *, int);
enum wb_stat_item {
WB_RECLAIMABLE,
WB_WRITEBACK,
WB_DIRTIED,
WB_WRITTEN,
NR_WB_STAT_ITEMS
};
#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
/*
* For cgroup writeback, multiple wb's may map to the same blkcg. Those
* wb's can operate mostly independently but should share the congested
* state. To facilitate such sharing, the congested state is tracked using
* the following struct which is created on demand, indexed by blkcg ID on
* its bdi, and refcounted.
*/
struct bdi_writeback_congested {
unsigned long state; /* WB_[a]sync_congested flags */
#ifdef CONFIG_CGROUP_WRITEBACK
struct backing_dev_info *bdi; /* the associated bdi */
atomic_t refcnt; /* nr of attached wb's and blkg */
int blkcg_id; /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
#endif
};
/*
* Each wb (bdi_writeback) can perform writeback operations, is measured
* and throttled, independently. Without cgroup writeback, each bdi
* (bdi_writeback) is served by its embedded bdi->wb.
*
* On the default hierarchy, blkcg implicitly enables memcg. This allows
* using memcg's page ownership for attributing writeback IOs, and every
* memcg - blkcg combination can be served by its own wb by assigning a
* dedicated wb to each memcg, which enables isolation across different
* cgroups and propagation of IO back pressure down from the IO layer upto
* the tasks which are generating the dirty pages to be written back.
*
* A cgroup wb is indexed on its bdi by the ID of the associated memcg,
* refcounted with the number of inodes attached to it, and pins the memcg
* and the corresponding blkcg. As the corresponding blkcg for a memcg may
* change as blkcg is disabled and enabled higher up in the hierarchy, a wb
* is tested for blkcg after lookup and removed from index on mismatch so
* that a new wb for the combination can be created.
*/
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
struct list_head b_dirty_time; /* time stamps are dirty */
spinlock_t list_lock; /* protects the b_* lists */
struct percpu_counter stat[NR_WB_STAT_ITEMS];
struct bdi_writeback_congested *congested;
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */
/*
* The base dirty throttle rate, re-calculated on every 200ms.
* All the bdi tasks' dirty rate will be curbed under it.
* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
* in small steps and is much more smooth/stable than the latter.
*/
unsigned long dirty_ratelimit;
unsigned long balanced_dirty_ratelimit;
struct fprop_local_percpu completions;
int dirty_exceeded;
spinlock_t work_lock; /* protects work_list & dwork scheduling */
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */
#ifdef CONFIG_CGROUP_WRITEBACK
struct percpu_ref refcnt; /* used only for !root wb's */
struct fprop_local_percpu memcg_completions;
struct cgroup_subsys_state *memcg_css; /* the associated memcg */
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
union {
struct work_struct release_work;
struct rcu_head rcu;
};
#endif
};
struct backing_dev_info {
struct list_head bdi_list;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
char *name;
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
/*
* Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are
* any dirty wbs, which is depended upon by bdi_has_dirty().
*/
atomic_long_t tot_write_bandwidth;
struct bdi_writeback wb; /* the root writeback info for this bdi */
struct bdi_writeback_congested wb_congested; /* its congested state */
#ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
#endif
wait_queue_head_t wb_waitq;
struct device *dev;
struct timer_list laptop_mode_wb_timer;
#ifdef CONFIG_DEBUG_FS
struct dentry *debug_dir;
struct dentry *debug_stats;
#endif
};
enum {
BLK_RW_ASYNC = 0,
BLK_RW_SYNC = 1,
};
void clear_wb_congested(struct bdi_writeback_congested *congested, int sync);
void set_wb_congested(struct bdi_writeback_congested *congested, int sync);
static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
clear_wb_congested(bdi->wb.congested, sync);
}
static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
{
set_wb_congested(bdi->wb.congested, sync);
}
#ifdef CONFIG_CGROUP_WRITEBACK
/**
* wb_tryget - try to increment a wb's refcount
* @wb: bdi_writeback to get
*/
static inline bool wb_tryget(struct bdi_writeback *wb)
{
if (wb != &wb->bdi->wb)
return percpu_ref_tryget(&wb->refcnt);
return true;
}
/**
* wb_get - increment a wb's refcount
* @wb: bdi_writeback to get
*/
static inline void wb_get(struct bdi_writeback *wb)
{
if (wb != &wb->bdi->wb)
percpu_ref_get(&wb->refcnt);
}
/**
* wb_put - decrement a wb's refcount
* @wb: bdi_writeback to put
*/
static inline void wb_put(struct bdi_writeback *wb)
{
if (wb != &wb->bdi->wb)
percpu_ref_put(&wb->refcnt);
}
/**
* wb_dying - is a wb dying?
* @wb: bdi_writeback of interest
*
* Returns whether @wb is unlinked and being drained.
*/
static inline bool wb_dying(struct bdi_writeback *wb)
{
return percpu_ref_is_dying(&wb->refcnt);
}
#else /* CONFIG_CGROUP_WRITEBACK */
static inline bool wb_tryget(struct bdi_writeback *wb)
{
return true;
}
static inline void wb_get(struct bdi_writeback *wb)
{
}
static inline void wb_put(struct bdi_writeback *wb)
{
}
static inline bool wb_dying(struct bdi_writeback *wb)
{
return false;
}
#endif /* CONFIG_CGROUP_WRITEBACK */
#endif /* __LINUX_BACKING_DEV_DEFS_H */

View File

@ -8,106 +8,13 @@
#ifndef _LINUX_BACKING_DEV_H
#define _LINUX_BACKING_DEV_H
#include <linux/percpu_counter.h>
#include <linux/log2.h>
#include <linux/flex_proportions.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
#include <linux/atomic.h>
#include <linux/sysctl.h>
#include <linux/workqueue.h>
struct page;
struct device;
struct dentry;
/*
* Bits in backing_dev_info.state
*/
enum bdi_state {
BDI_async_congested, /* The async (write) queue is getting full */
BDI_sync_congested, /* The sync queue is getting full */
BDI_registered, /* bdi_register() was done */
BDI_writeback_running, /* Writeback is in progress */
};
typedef int (congested_fn)(void *, int);
enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
BDI_DIRTIED,
BDI_WRITTEN,
NR_BDI_STAT_ITEMS
};
#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
unsigned long last_old_flush; /* last old data flush */
struct delayed_work dwork; /* work item used for writeback */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
struct list_head b_dirty_time; /* time stamps are dirty */
spinlock_t list_lock; /* protects the b_* lists */
};
struct backing_dev_info {
struct list_head bdi_list;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
char *name;
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */
/*
* The base dirty throttle rate, re-calculated on every 200ms.
* All the bdi tasks' dirty rate will be curbed under it.
* @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
* in small steps and is much more smooth/stable than the latter.
*/
unsigned long dirty_ratelimit;
unsigned long balanced_dirty_ratelimit;
struct fprop_local_percpu completions;
int dirty_exceeded;
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
struct bdi_writeback wb; /* default writeback info for this bdi */
spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */
struct list_head work_list;
struct device *dev;
struct timer_list laptop_mode_wb_timer;
#ifdef CONFIG_DEBUG_FS
struct dentry *debug_dir;
struct dentry *debug_stats;
#endif
};
struct backing_dev_info *inode_to_bdi(struct inode *inode);
#include <linux/blk-cgroup.h>
#include <linux/backing-dev-defs.h>
int __must_check bdi_init(struct backing_dev_info *bdi);
void bdi_destroy(struct backing_dev_info *bdi);
@ -117,97 +24,99 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...);
int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
void bdi_writeback_workfn(struct work_struct *work);
int bdi_has_dirty_io(struct backing_dev_info *bdi);
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
void wb_start_background_writeback(struct bdi_writeback *wb);
void wb_workfn(struct work_struct *work);
void wb_wakeup_delayed(struct bdi_writeback *wb);
extern spinlock_t bdi_lock;
extern struct list_head bdi_list;
extern struct workqueue_struct *bdi_wq;
static inline int wb_has_dirty_io(struct bdi_writeback *wb)
static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
{
return !list_empty(&wb->b_dirty) ||
!list_empty(&wb->b_io) ||
!list_empty(&wb->b_more_io);
return test_bit(WB_has_dirty_io, &wb->state);
}
static inline void __add_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item, s64 amount)
static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
{
__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
/*
* @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
* any dirty wbs. See wb_update_write_bandwidth().
*/
return atomic_long_read(&bdi->tot_write_bandwidth);
}
static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline void __add_wb_stat(struct bdi_writeback *wb,
enum wb_stat_item item, s64 amount)
{
__add_bdi_stat(bdi, item, 1);
__percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
}
static inline void inc_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline void __inc_wb_stat(struct bdi_writeback *wb,
enum wb_stat_item item)
{
__add_wb_stat(wb, item, 1);
}
static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;
local_irq_save(flags);
__inc_bdi_stat(bdi, item);
__inc_wb_stat(wb, item);
local_irq_restore(flags);
}
static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline void __dec_wb_stat(struct bdi_writeback *wb,
enum wb_stat_item item)
{
__add_bdi_stat(bdi, item, -1);
__add_wb_stat(wb, item, -1);
}
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;
local_irq_save(flags);
__dec_bdi_stat(bdi, item);
__dec_wb_stat(wb, item);
local_irq_restore(flags);
}
static inline s64 bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
return percpu_counter_read_positive(&bdi->bdi_stat[item]);
return percpu_counter_read_positive(&wb->stat[item]);
}
static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
enum wb_stat_item item)
{
return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
return percpu_counter_sum_positive(&wb->stat[item]);
}
static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
enum bdi_stat_item item)
static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
{
s64 sum;
unsigned long flags;
local_irq_save(flags);
sum = __bdi_stat_sum(bdi, item);
sum = __wb_stat_sum(wb, item);
local_irq_restore(flags);
return sum;
}
extern void bdi_writeout_inc(struct backing_dev_info *bdi);
extern void wb_writeout_inc(struct bdi_writeback *wb);
/*
* maximal error of a stat counter.
*/
static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
static inline unsigned long wb_stat_error(struct bdi_writeback *wb)
{
#ifdef CONFIG_SMP
return nr_cpu_ids * BDI_STAT_BATCH;
return nr_cpu_ids * WB_STAT_BATCH;
#else
return 1;
#endif
@ -231,50 +140,57 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
* BDI_CAP_NO_WRITEBACK: Don't write pages back
* BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
*
* BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
*/
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
#define BDI_CAP_NO_ACCT_WB 0x00000004
#define BDI_CAP_STABLE_WRITES 0x00000008
#define BDI_CAP_STRICTLIMIT 0x00000010
#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
extern struct backing_dev_info noop_backing_dev_info;
int writeback_in_progress(struct backing_dev_info *bdi);
static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
/**
* writeback_in_progress - determine whether there is writeback in progress
* @wb: bdi_writeback of interest
*
* Determine whether there is writeback waiting to be handled against a
* bdi_writeback.
*/
static inline bool writeback_in_progress(struct bdi_writeback *wb)
{
return test_bit(WB_writeback_running, &wb->state);
}
static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
{
struct super_block *sb;
if (!inode)
return &noop_backing_dev_info;
sb = inode->i_sb;
#ifdef CONFIG_BLOCK
if (sb_is_blkdev_sb(sb))
return blk_get_backing_dev_info(I_BDEV(inode));
#endif
return sb->s_bdi;
}
static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
{
struct backing_dev_info *bdi = wb->bdi;
if (bdi->congested_fn)
return bdi->congested_fn(bdi->congested_data, bdi_bits);
return (bdi->state & bdi_bits);
return bdi->congested_fn(bdi->congested_data, cong_bits);
return wb->congested->state & cong_bits;
}
static inline int bdi_read_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, 1 << BDI_sync_congested);
}
static inline int bdi_write_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, 1 << BDI_async_congested);
}
static inline int bdi_rw_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, (1 << BDI_sync_congested) |
(1 << BDI_async_congested));
}
enum {
BLK_RW_ASYNC = 0,
BLK_RW_SYNC = 1,
};
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
long wait_iff_congested(struct zone *zone, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
@ -318,4 +234,333 @@ static inline int bdi_sched_wait(void *word)
return 0;
}
#endif /* _LINUX_BACKING_DEV_H */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback_congested *
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp);
void wb_congested_put(struct bdi_writeback_congested *congested);
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css,
gfp_t gfp);
void wb_memcg_offline(struct mem_cgroup *memcg);
void wb_blkcg_offline(struct blkcg *blkcg);
int inode_congested(struct inode *inode, int cong_bits);
/**
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
* @inode: inode of interest
*
* cgroup writeback requires support from both the bdi and filesystem.
* Test whether @inode has both.
*/
static inline bool inode_cgwb_enabled(struct inode *inode)
{
struct backing_dev_info *bdi = inode_to_bdi(inode);
return bdi_cap_account_dirty(bdi) &&
(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
(inode->i_sb->s_iflags & SB_I_CGROUPWB);
}
/**
* wb_find_current - find wb for %current on a bdi
* @bdi: bdi of interest
*
* Find the wb of @bdi which matches both the memcg and blkcg of %current.
* Must be called under rcu_read_lock() which protects the returend wb.
* NULL if not found.
*/
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
{
struct cgroup_subsys_state *memcg_css;
struct bdi_writeback *wb;
memcg_css = task_css(current, memory_cgrp_id);
if (!memcg_css->parent)
return &bdi->wb;
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
/*
* %current's blkcg equals the effective blkcg of its memcg. No
* need to use the relatively expensive cgroup_get_e_css().
*/
if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id)))
return wb;
return NULL;
}
/**
* wb_get_create_current - get or create wb for %current on a bdi
* @bdi: bdi of interest
* @gfp: allocation mask
*
* Equivalent to wb_get_create() on %current's memcg. This function is
* called from a relatively hot path and optimizes the common cases using
* wb_find_current().
*/
static inline struct bdi_writeback *
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
{
struct bdi_writeback *wb;
rcu_read_lock();
wb = wb_find_current(bdi);
if (wb && unlikely(!wb_tryget(wb)))
wb = NULL;
rcu_read_unlock();
if (unlikely(!wb)) {
struct cgroup_subsys_state *memcg_css;
memcg_css = task_get_css(current, memory_cgrp_id);
wb = wb_get_create(bdi, memcg_css, gfp);
css_put(memcg_css);
}
return wb;
}
/**
* inode_to_wb_is_valid - test whether an inode has a wb associated
* @inode: inode of interest
*
* Returns %true if @inode has a wb associated. May be called without any
* locking.
*/
static inline bool inode_to_wb_is_valid(struct inode *inode)
{
return inode->i_wb;
}
/**
* inode_to_wb - determine the wb of an inode
* @inode: inode of interest
*
* Returns the wb @inode is currently associated with. The caller must be
* holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the
* associated wb's list_lock.
*/
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
{
#ifdef CONFIG_LOCKDEP
WARN_ON_ONCE(debug_locks &&
(!lockdep_is_held(&inode->i_lock) &&
!lockdep_is_held(&inode->i_mapping->tree_lock) &&
!lockdep_is_held(&inode->i_wb->list_lock)));
#endif
return inode->i_wb;
}
/**
* unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
* @inode: target inode
* @lockedp: temp bool output param, to be passed to the end function
*
* The caller wants to access the wb associated with @inode but isn't
* holding inode->i_lock, mapping->tree_lock or wb->list_lock. This
* function determines the wb associated with @inode and ensures that the
* association doesn't change until the transaction is finished with
* unlocked_inode_to_wb_end().
*
* The caller must call unlocked_inode_to_wb_end() with *@lockdep
* afterwards and can't sleep during transaction. IRQ may or may not be
* disabled on return.
*/
static inline struct bdi_writeback *
unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
{
rcu_read_lock();
/*
* Paired with store_release in inode_switch_wb_work_fn() and
* ensures that we see the new wb if we see cleared I_WB_SWITCH.
*/
*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
if (unlikely(*lockedp))
spin_lock_irq(&inode->i_mapping->tree_lock);
/*
* Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
* inode_to_wb() will bark. Deref directly.
*/
return inode->i_wb;
}
/**
* unlocked_inode_to_wb_end - end inode wb access transaction
* @inode: target inode
* @locked: *@lockedp from unlocked_inode_to_wb_begin()
*/
static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
{
if (unlikely(locked))
spin_unlock_irq(&inode->i_mapping->tree_lock);
rcu_read_unlock();
}
struct wb_iter {
int start_blkcg_id;
struct radix_tree_iter tree_iter;
void **slot;
};
static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
struct backing_dev_info *bdi)
{
struct radix_tree_iter *titer = &iter->tree_iter;
WARN_ON_ONCE(!rcu_read_lock_held());
if (iter->start_blkcg_id >= 0) {
iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id);
iter->start_blkcg_id = -1;
} else {
iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
}
if (!iter->slot)
iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
if (iter->slot)
return *iter->slot;
return NULL;
}
static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
struct backing_dev_info *bdi,
int start_blkcg_id)
{
iter->start_blkcg_id = start_blkcg_id;
if (start_blkcg_id)
return __wb_iter_next(iter, bdi);
else
return &bdi->wb;
}
/**
* bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order
* @wb_cur: cursor struct bdi_writeback pointer
* @bdi: bdi to walk wb's of
* @iter: pointer to struct wb_iter to be used as iteration buffer
* @start_blkcg_id: blkcg ID to start iteration from
*
* Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
* blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter
* to be used as temp storage during iteration. rcu_read_lock() must be
* held throughout iteration.
*/
#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \
(wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
#else /* CONFIG_CGROUP_WRITEBACK */
static inline bool inode_cgwb_enabled(struct inode *inode)
{
return false;
}
static inline struct bdi_writeback_congested *
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
{
return bdi->wb.congested;
}
static inline void wb_congested_put(struct bdi_writeback_congested *congested)
{
}
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
{
return &bdi->wb;
}
static inline struct bdi_writeback *
wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
{
return &bdi->wb;
}
static inline bool inode_to_wb_is_valid(struct inode *inode)
{
return true;
}
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
{
return &inode_to_bdi(inode)->wb;
}
static inline struct bdi_writeback *
unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
{
return inode_to_wb(inode);
}
static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
{
}
static inline void wb_memcg_offline(struct mem_cgroup *memcg)
{
}
static inline void wb_blkcg_offline(struct blkcg *blkcg)
{
}
struct wb_iter {
int next_id;
};
#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
for ((iter)->next_id = (start_blkcg_id); \
({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
static inline int inode_congested(struct inode *inode, int cong_bits)
{
return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
}
#endif /* CONFIG_CGROUP_WRITEBACK */
static inline int inode_read_congested(struct inode *inode)
{
return inode_congested(inode, 1 << WB_sync_congested);
}
static inline int inode_write_congested(struct inode *inode)
{
return inode_congested(inode, 1 << WB_async_congested);
}
static inline int inode_rw_congested(struct inode *inode)
{
return inode_congested(inode, (1 << WB_sync_congested) |
(1 << WB_async_congested));
}
static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
{
return wb_congested(&bdi->wb, cong_bits);
}
static inline int bdi_read_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, 1 << WB_sync_congested);
}
static inline int bdi_write_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, 1 << WB_async_congested);
}
static inline int bdi_rw_congested(struct backing_dev_info *bdi)
{
return bdi_congested(bdi, (1 << WB_sync_congested) |
(1 << WB_async_congested));
}
#endif /* _LINUX_BACKING_DEV_H */

View File

@ -482,9 +482,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);
#ifdef CONFIG_BLK_CGROUP
int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css);
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
#else /* CONFIG_BLK_CGROUP */
static inline int bio_associate_blkcg(struct bio *bio,
struct cgroup_subsys_state *blkcg_css) { return 0; }
static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */

View File

@ -46,6 +46,10 @@ struct blkcg {
struct hlist_head blkg_list;
struct blkcg_policy_data *pd[BLKCG_MAX_POLS];
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
#endif
};
struct blkg_stat {
@ -106,6 +110,12 @@ struct blkcg_gq {
struct hlist_node blkcg_node;
struct blkcg *blkcg;
/*
* Each blkg gets congested separately and the congestion state is
* propagated to the matching bdi_writeback_congested.
*/
struct bdi_writeback_congested *wb_congested;
/* all non-root blkcg_gq's are guaranteed to have access to parent */
struct blkcg_gq *parent;
@ -149,6 +159,7 @@ struct blkcg_policy {
};
extern struct blkcg blkcg_root;
extern struct cgroup_subsys_state * const blkcg_root_css;
struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@ -209,6 +220,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current);
}
static inline struct cgroup_subsys_state *
task_get_blkcg_css(struct task_struct *task)
{
return task_get_css(task, blkio_cgrp_id);
}
/**
* blkcg_parent - get the parent of a blkcg
* @blkcg: blkcg of interest
@ -579,8 +596,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
#else /* CONFIG_BLK_CGROUP */
struct cgroup;
struct blkcg;
struct blkcg {
};
struct blkg_policy_data {
};
@ -594,6 +611,16 @@ struct blkcg_gq {
struct blkcg_policy {
};
#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
static inline struct cgroup_subsys_state *
task_get_blkcg_css(struct task_struct *task)
{
return NULL;
}
#ifdef CONFIG_BLOCK
static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
static inline void blkcg_drain_queue(struct request_queue *q) { }
@ -623,5 +650,6 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q
#define blk_queue_for_each_rl(rl, q) \
for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
#endif /* CONFIG_BLOCK */
#endif /* CONFIG_BLK_CGROUP */
#endif /* _BLK_CGROUP_H */

View File

@ -12,7 +12,7 @@
#include <linux/timer.h>
#include <linux/workqueue.h>
#include <linux/pagemap.h>
#include <linux/backing-dev.h>
#include <linux/backing-dev-defs.h>
#include <linux/wait.h>
#include <linux/mempool.h>
#include <linux/bio.h>
@ -787,25 +787,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
struct scsi_ioctl_command __user *);
/*
* A queue has just exitted congestion. Note this in the global counter of
* congested queues, and wake up anyone who was waiting for requests to be
* put back.
*/
static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
{
clear_bdi_congested(&q->backing_dev_info, sync);
}
/*
* A queue has just entered congestion. Flag that in the queue's VM-visible
* state flags and increment the global gounter of congested queues.
*/
static inline void blk_set_queue_congested(struct request_queue *q, int sync)
{
set_bdi_congested(&q->backing_dev_info, sync);
}
extern void blk_start_queue(struct request_queue *q);
extern void blk_stop_queue(struct request_queue *q);
extern void blk_sync_queue(struct request_queue *q);

View File

@ -773,6 +773,31 @@ static inline struct cgroup_subsys_state *task_css(struct task_struct *task,
return task_css_check(task, subsys_id, false);
}
/**
* task_get_css - find and get the css for (task, subsys)
* @task: the target task
* @subsys_id: the target subsystem ID
*
* Find the css for the (@task, @subsys_id) combination, increment a
* reference on and return it. This function is guaranteed to return a
* valid css.
*/
static inline struct cgroup_subsys_state *
task_get_css(struct task_struct *task, int subsys_id)
{
struct cgroup_subsys_state *css;
rcu_read_lock();
while (true) {
css = task_css(task, subsys_id);
if (likely(css_tryget_online(css)))
break;
cpu_relax();
}
rcu_read_unlock();
return css;
}
/**
* task_css_is_root - test whether a task belongs to the root css
* @task: the target task

View File

@ -35,6 +35,7 @@
#include <uapi/linux/fs.h>
struct backing_dev_info;
struct bdi_writeback;
struct export_operations;
struct hd_geometry;
struct iovec;
@ -634,6 +635,14 @@ struct inode {
struct hlist_node i_hash;
struct list_head i_wb_list; /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *i_wb; /* the associated cgroup wb */
/* foreign inode detection, see wbc_detach_inode() */
int i_wb_frn_winner;
u16 i_wb_frn_avg_time;
u16 i_wb_frn_history;
#endif
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
@ -1232,6 +1241,8 @@ struct mm_struct;
#define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */
#define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */
/* sb->s_iflags */
#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */
/* Possible states of 'frozen' field */
enum {
@ -1270,6 +1281,7 @@ struct super_block {
const struct quotactl_ops *s_qcop;
const struct export_operations *s_export_op;
unsigned long s_flags;
unsigned long s_iflags; /* internal SB_I_* flags */
unsigned long s_magic;
struct dentry *s_root;
struct rw_semaphore s_umount;
@ -1806,6 +1818,11 @@ struct super_operations {
*
* I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit().
*
* I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
* synchronize competing switching instances and to tell
* wb stat updates to grab mapping->tree_lock. See
* inode_switch_wb_work_fn() for details.
*
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
#define I_DIRTY_SYNC (1 << 0)
@ -1825,6 +1842,7 @@ struct super_operations {
#define I_DIRTY_TIME (1 << 11)
#define __I_DIRTY_TIME_EXPIRED 12
#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED)
#define I_WB_SWITCH (1 << 13)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
@ -2241,7 +2259,13 @@ extern struct super_block *freeze_bdev(struct block_device *);
extern void emergency_thaw_all(void);
extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
extern int fsync_bdev(struct block_device *);
extern int sb_is_blkdev_sb(struct super_block *sb);
extern struct super_block *blockdev_superblock;
static inline bool sb_is_blkdev_sb(struct super_block *sb)
{
return sb == blockdev_superblock;
}
#else
static inline void bd_forget(struct inode *inode) {}
static inline int sync_blockdev(struct block_device *bdev) { return 0; }

View File

@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
MEM_CGROUP_STAT_NSTATS,
@ -67,6 +68,8 @@ enum mem_cgroup_events_index {
};
#ifdef CONFIG_MEMCG
extern struct cgroup_subsys_state *mem_cgroup_root_css;
void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr);
@ -112,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
}
extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup *,
@ -195,6 +199,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
#else /* CONFIG_MEMCG */
struct mem_cgroup;
#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
static inline void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr)
@ -382,6 +388,29 @@ enum {
OVER_LIMIT,
};
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
unsigned long *pdirty, unsigned long *pwriteback);
#else /* CONFIG_CGROUP_WRITEBACK */
static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
{
return NULL;
}
static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
unsigned long *pavail,
unsigned long *pdirty,
unsigned long *pwriteback)
{
}
#endif /* CONFIG_CGROUP_WRITEBACK */
struct sock;
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
void sock_update_memcg(struct sock *sk);

View File

@ -27,6 +27,7 @@ struct anon_vma_chain;
struct file_ra_state;
struct user_struct;
struct writeback_control;
struct bdi_writeback;
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
@ -1211,10 +1212,13 @@ int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
void account_page_dirtied(struct page *page, struct address_space *mapping);
void account_page_cleaned(struct page *page, struct address_space *mapping);
void account_page_dirtied(struct page *page, struct address_space *mapping,
struct mem_cgroup *memcg);
void account_page_cleaned(struct page *page, struct address_space *mapping,
struct mem_cgroup *memcg, struct bdi_writeback *wb);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
void cancel_dirty_page(struct page *page);
int clear_page_dirty_for_io(struct page *page);
int get_cmdline(struct task_struct *task, char *buffer, int buflen);

View File

@ -651,7 +651,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
extern void delete_from_page_cache(struct page *page);
extern void __delete_from_page_cache(struct page *page, void *shadow);
extern void __delete_from_page_cache(struct page *page, void *shadow,
struct mem_cgroup *memcg);
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
/*

View File

@ -7,6 +7,8 @@
#include <linux/sched.h>
#include <linux/workqueue.h>
#include <linux/fs.h>
#include <linux/flex_proportions.h>
#include <linux/backing-dev-defs.h>
DECLARE_PER_CPU(int, dirty_throttle_leaks);
@ -84,8 +86,85 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
struct inode *inode; /* inode being written out */
/* foreign inode detection, see wbc_detach_inode() */
int wb_id; /* current wb id */
int wb_lcand_id; /* last foreign candidate wb id */
int wb_tcand_id; /* this foreign candidate wb id */
size_t wb_bytes; /* bytes written by current wb */
size_t wb_lcand_bytes; /* bytes written by last candidate */
size_t wb_tcand_bytes; /* bytes written by this candidate */
#endif
};
/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
* domain, global_wb_domain, that every wb in the system is a member of.
* This allows measuring the relative bandwidth of each wb to distribute
* dirtyable memory accordingly.
*/
struct wb_domain {
spinlock_t lock;
/*
* Scale the writeback cache size proportional to the relative
* writeout speed.
*
* We do this by keeping a floating proportion between BDIs, based
* on page writeback completions [end_page_writeback()]. Those
* devices that write out pages fastest will get the larger share,
* while the slower will get a smaller share.
*
* We use page writeout completions because we are interested in
* getting rid of dirty pages. Having them written out is the
* primary goal.
*
* We introduce a concept of time, a period over which we measure
* these events, because demand can/will vary over time. The length
* of this period itself is measured in page writeback completions.
*/
struct fprop_global completions;
struct timer_list period_timer; /* timer for aging of completions */
unsigned long period_time;
/*
* The dirtyable memory and dirty threshold could be suddenly
* knocked down by a large amount (eg. on the startup of KVM in a
* swapless system). This may throw the system into deep dirty
* exceeded state and throttle heavy/light dirtiers alike. To
* retain good responsiveness, maintain global_dirty_limit for
* tracking slowly down to the knocked down dirty threshold.
*
* Both fields are protected by ->lock.
*/
unsigned long dirty_limit_tstamp;
unsigned long dirty_limit;
};
/**
* wb_domain_size_changed - memory available to a wb_domain has changed
* @dom: wb_domain of interest
*
* This function should be called when the amount of memory available to
* @dom has changed. It resets @dom's dirty limit parameters to prevent
* the past values which don't match the current configuration from skewing
* dirty throttling. Without this, when memory size of a wb_domain is
* greatly reduced, the dirty throttling logic may allow too many pages to
* be dirtied leading to consecutive unnecessary OOMs and may get stuck in
* that situation.
*/
static inline void wb_domain_size_changed(struct wb_domain *dom)
{
spin_lock(&dom->lock);
dom->dirty_limit_tstamp = jiffies;
dom->dirty_limit = 0;
spin_unlock(&dom->lock);
}
/*
* fs/fs-writeback.c
*/
@ -93,9 +172,9 @@ struct bdi_writeback;
void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
void inode_wait_for_writeback(struct inode *inode);
@ -107,6 +186,123 @@ static inline void wait_on_inode(struct inode *inode)
wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
}
#ifdef CONFIG_CGROUP_WRITEBACK
#include <linux/cgroup.h>
#include <linux/bio.h>
void __inode_attach_wb(struct inode *inode, struct page *page);
void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
__releases(&inode->i_lock);
void wbc_detach_inode(struct writeback_control *wbc);
void wbc_account_io(struct writeback_control *wbc, struct page *page,
size_t bytes);
/**
* inode_attach_wb - associate an inode with its wb
* @inode: inode of interest
* @page: page being dirtied (may be NULL)
*
* If @inode doesn't have its wb, associate it with the wb matching the
* memcg of @page or, if @page is NULL, %current. May be called w/ or w/o
* @inode->i_lock.
*/
static inline void inode_attach_wb(struct inode *inode, struct page *page)
{
if (!inode->i_wb)
__inode_attach_wb(inode, page);
}
/**
* inode_detach_wb - disassociate an inode from its wb
* @inode: inode of interest
*
* @inode is being freed. Detach from its wb.
*/
static inline void inode_detach_wb(struct inode *inode)
{
if (inode->i_wb) {
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
}
/**
* wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
* @wbc: writeback_control of interest
* @inode: target inode
*
* This function is to be used by __filemap_fdatawrite_range(), which is an
* alternative entry point into writeback code, and first ensures @inode is
* associated with a bdi_writeback and attaches it to @wbc.
*/
static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc,
struct inode *inode)
{
spin_lock(&inode->i_lock);
inode_attach_wb(inode, NULL);
wbc_attach_and_unlock_inode(wbc, inode);
}
/**
* wbc_init_bio - writeback specific initializtion of bio
* @wbc: writeback_control for the writeback in progress
* @bio: bio to be initialized
*
* @bio is a part of the writeback in progress controlled by @wbc. Perform
* writeback specific initialization. This is used to apply the cgroup
* writeback context.
*/
static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
{
/*
* pageout() path doesn't attach @wbc to the inode being written
* out. This is intentional as we don't want the function to block
* behind a slow cgroup. Ultimately, we want pageout() to kick off
* regular writeback instead of writing things out itself.
*/
if (wbc->wb)
bio_associate_blkcg(bio, wbc->wb->blkcg_css);
}
#else /* CONFIG_CGROUP_WRITEBACK */
static inline void inode_attach_wb(struct inode *inode, struct page *page)
{
}
static inline void inode_detach_wb(struct inode *inode)
{
}
static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
__releases(&inode->i_lock)
{
spin_unlock(&inode->i_lock);
}
static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc,
struct inode *inode)
{
}
static inline void wbc_detach_inode(struct writeback_control *wbc)
{
}
static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
{
}
static inline void wbc_account_io(struct writeback_control *wbc,
struct page *page, size_t bytes)
{
}
#endif /* CONFIG_CGROUP_WRITEBACK */
/*
* mm/page-writeback.c
*/
@ -120,8 +316,12 @@ static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
bool zone_dirty_ok(struct zone *zone);
int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
#ifdef CONFIG_CGROUP_WRITEBACK
void wb_domain_exit(struct wb_domain *dom);
#endif
extern unsigned long global_dirty_limit;
extern struct wb_domain global_wb_domain;
/* These are exported to sysctl. */
extern int dirty_background_ratio;
@ -155,19 +355,12 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
unsigned long dirty);
void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
unsigned long start_time);
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
bool wb_over_bg_thresh(struct bdi_writeback *wb);
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);

View File

@ -360,7 +360,7 @@ TRACE_EVENT(global_dirty_state,
__entry->nr_written = global_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
__entry->dirty_thresh = dirty_thresh;
__entry->dirty_limit = global_dirty_limit;
__entry->dirty_limit = global_wb_domain.dirty_limit;
),
TP_printk("dirty=%lu writeback=%lu unstable=%lu "
@ -399,13 +399,13 @@ TRACE_EVENT(bdi_dirty_ratelimit,
TP_fast_assign(
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
__entry->write_bw = KBps(bdi->write_bandwidth);
__entry->avg_write_bw = KBps(bdi->avg_write_bandwidth);
__entry->write_bw = KBps(bdi->wb.write_bandwidth);
__entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
__entry->dirty_rate = KBps(dirty_rate);
__entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
__entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
__entry->task_ratelimit = KBps(task_ratelimit);
__entry->balanced_dirty_ratelimit =
KBps(bdi->balanced_dirty_ratelimit);
KBps(bdi->wb.balanced_dirty_ratelimit);
),
TP_printk("bdi %s: "
@ -462,8 +462,9 @@ TRACE_EVENT(balance_dirty_pages,
unsigned long freerun = (thresh + bg_thresh) / 2;
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
__entry->limit = global_dirty_limit;
__entry->setpoint = (global_dirty_limit + freerun) / 2;
__entry->limit = global_wb_domain.dirty_limit;
__entry->setpoint = (global_wb_domain.dirty_limit +
freerun) / 2;
__entry->dirty = dirty;
__entry->bdi_setpoint = __entry->setpoint *
bdi_thresh / (thresh + 1);

View File

@ -1127,6 +1127,11 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
config CGROUP_WRITEBACK
bool
depends on MEMCG && BLK_CGROUP
default y
endif # CGROUPS
config CHECKPOINT_RESTORE

View File

@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = {
.name = "noop",
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
EXPORT_SYMBOL_GPL(noop_backing_dev_info);
static struct class *bdi_class;
@ -48,7 +49,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
struct bdi_writeback *wb = &bdi->wb;
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
unsigned long wb_thresh;
unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
struct inode *inode;
@ -66,7 +67,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
spin_unlock(&wb->list_lock);
global_dirty_limits(&background_thresh, &dirty_thresh);
bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
@ -84,19 +85,19 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"b_dirty_time: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
K(bdi_thresh),
(unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
(unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
K(wb_thresh),
K(dirty_thresh),
K(background_thresh),
(unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
(unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
(unsigned long) K(bdi->write_bandwidth),
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
(unsigned long) K(wb->write_bandwidth),
nr_dirty,
nr_io,
nr_more_io,
nr_dirty_time,
!list_empty(&bdi->bdi_list), bdi->state);
!list_empty(&bdi->bdi_list), bdi->wb.state);
#undef K
return 0;
@ -255,13 +256,8 @@ static int __init default_bdi_init(void)
}
subsys_initcall(default_bdi_init);
int bdi_has_dirty_io(struct backing_dev_info *bdi)
{
return wb_has_dirty_io(&bdi->wb);
}
/*
* This function is used when the first inode for this bdi is marked dirty. It
* This function is used when the first inode for this wb is marked dirty. It
* wakes-up the corresponding bdi thread which should then take care of the
* periodic background write-out of dirty inodes. Since the write-out would
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
@ -274,29 +270,497 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
* We have to be careful not to postpone flush work if it is scheduled for
* earlier. Thus we use queue_delayed_work().
*/
void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
void wb_wakeup_delayed(struct bdi_writeback *wb)
{
unsigned long timeout;
timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
spin_lock_bh(&bdi->wb_lock);
if (test_bit(BDI_registered, &bdi->state))
queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
spin_unlock_bh(&bdi->wb_lock);
spin_lock_bh(&wb->work_lock);
if (test_bit(WB_registered, &wb->state))
queue_delayed_work(bdi_wq, &wb->dwork, timeout);
spin_unlock_bh(&wb->work_lock);
}
/*
* Remove bdi from bdi_list, and ensure that it is no longer visible
* Initial write bandwidth: 100 MB/s
*/
static void bdi_remove_from_list(struct backing_dev_info *bdi)
{
spin_lock_bh(&bdi_lock);
list_del_rcu(&bdi->bdi_list);
spin_unlock_bh(&bdi_lock);
#define INIT_BW (100 << (20 - PAGE_SHIFT))
synchronize_rcu_expedited();
static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
gfp_t gfp)
{
int i, err;
memset(wb, 0, sizeof(*wb));
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
INIT_LIST_HEAD(&wb->b_dirty_time);
spin_lock_init(&wb->list_lock);
wb->bw_time_stamp = jiffies;
wb->balanced_dirty_ratelimit = INIT_BW;
wb->dirty_ratelimit = INIT_BW;
wb->write_bandwidth = INIT_BW;
wb->avg_write_bandwidth = INIT_BW;
spin_lock_init(&wb->work_lock);
INIT_LIST_HEAD(&wb->work_list);
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
return err;
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
err = percpu_counter_init(&wb->stat[i], 0, gfp);
if (err) {
while (--i)
percpu_counter_destroy(&wb->stat[i]);
fprop_local_destroy_percpu(&wb->completions);
return err;
}
}
return 0;
}
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
static void wb_shutdown(struct bdi_writeback *wb)
{
/* Make sure nobody queues further work */
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
return;
}
spin_unlock_bh(&wb->work_lock);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
* be drained no matter what.
*/
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
}
static void wb_exit(struct bdi_writeback *wb)
{
int i;
WARN_ON(delayed_work_pending(&wb->dwork));
for (i = 0; i < NR_WB_STAT_ITEMS; i++)
percpu_counter_destroy(&wb->stat[i]);
fprop_local_destroy_percpu(&wb->completions);
}
#ifdef CONFIG_CGROUP_WRITEBACK
#include <linux/memcontrol.h>
/*
* cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
* blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
* protected. cgwb_release_wait is used to wait for the completion of cgwb
* releases from bdi destruction path.
*/
static DEFINE_SPINLOCK(cgwb_lock);
static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
/**
* wb_congested_get_create - get or create a wb_congested
* @bdi: associated bdi
* @blkcg_id: ID of the associated blkcg
* @gfp: allocation mask
*
* Look up the wb_congested for @blkcg_id on @bdi. If missing, create one.
* The returned wb_congested has its reference count incremented. Returns
* NULL on failure.
*/
struct bdi_writeback_congested *
wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
{
struct bdi_writeback_congested *new_congested = NULL, *congested;
struct rb_node **node, *parent;
unsigned long flags;
if (blkcg_id == 1)
return &bdi->wb_congested;
retry:
spin_lock_irqsave(&cgwb_lock, flags);
node = &bdi->cgwb_congested_tree.rb_node;
parent = NULL;
while (*node != NULL) {
parent = *node;
congested = container_of(parent, struct bdi_writeback_congested,
rb_node);
if (congested->blkcg_id < blkcg_id)
node = &parent->rb_left;
else if (congested->blkcg_id > blkcg_id)
node = &parent->rb_right;
else
goto found;
}
if (new_congested) {
/* !found and storage for new one already allocated, insert */
congested = new_congested;
new_congested = NULL;
rb_link_node(&congested->rb_node, parent, node);
rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree);
atomic_inc(&bdi->usage_cnt);
goto found;
}
spin_unlock_irqrestore(&cgwb_lock, flags);
/* allocate storage for new one and retry */
new_congested = kzalloc(sizeof(*new_congested), gfp);
if (!new_congested)
return NULL;
atomic_set(&new_congested->refcnt, 0);
new_congested->bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
found:
atomic_inc(&congested->refcnt);
spin_unlock_irqrestore(&cgwb_lock, flags);
kfree(new_congested);
return congested;
}
/**
* wb_congested_put - put a wb_congested
* @congested: wb_congested to put
*
* Put @congested and destroy it if the refcnt reaches zero.
*/
void wb_congested_put(struct bdi_writeback_congested *congested)
{
struct backing_dev_info *bdi = congested->bdi;
unsigned long flags;
if (congested->blkcg_id == 1)
return;
local_irq_save(flags);
if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) {
local_irq_restore(flags);
return;
}
rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree);
spin_unlock_irqrestore(&cgwb_lock, flags);
kfree(congested);
if (atomic_dec_and_test(&bdi->usage_cnt))
wake_up_all(&cgwb_release_wait);
}
static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
struct backing_dev_info *bdi = wb->bdi;
wb_shutdown(wb);
css_put(wb->memcg_css);
css_put(wb->blkcg_css);
wb_congested_put(wb->congested);
fprop_local_destroy_percpu(&wb->memcg_completions);
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
if (atomic_dec_and_test(&bdi->usage_cnt))
wake_up_all(&cgwb_release_wait);
}
static void cgwb_release(struct percpu_ref *refcnt)
{
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
refcnt);
schedule_work(&wb->release_work);
}
static void cgwb_kill(struct bdi_writeback *wb)
{
lockdep_assert_held(&cgwb_lock);
WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
list_del(&wb->memcg_node);
list_del(&wb->blkcg_node);
percpu_ref_kill(&wb->refcnt);
}
static int cgwb_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
{
struct mem_cgroup *memcg;
struct cgroup_subsys_state *blkcg_css;
struct blkcg *blkcg;
struct list_head *memcg_cgwb_list, *blkcg_cgwb_list;
struct bdi_writeback *wb;
unsigned long flags;
int ret = 0;
memcg = mem_cgroup_from_css(memcg_css);
blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys);
blkcg = css_to_blkcg(blkcg_css);
memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
blkcg_cgwb_list = &blkcg->cgwb_list;
/* look up again under lock and discard on blkcg mismatch */
spin_lock_irqsave(&cgwb_lock, flags);
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
if (wb && wb->blkcg_css != blkcg_css) {
cgwb_kill(wb);
wb = NULL;
}
spin_unlock_irqrestore(&cgwb_lock, flags);
if (wb)
goto out_put;
/* need to create a new one */
wb = kmalloc(sizeof(*wb), gfp);
if (!wb)
return -ENOMEM;
ret = wb_init(wb, bdi, gfp);
if (ret)
goto err_free;
ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp);
if (ret)
goto err_wb_exit;
ret = fprop_local_init_percpu(&wb->memcg_completions, gfp);
if (ret)
goto err_ref_exit;
wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp);
if (!wb->congested) {
ret = -ENOMEM;
goto err_fprop_exit;
}
wb->memcg_css = memcg_css;
wb->blkcg_css = blkcg_css;
INIT_WORK(&wb->release_work, cgwb_release_workfn);
set_bit(WB_registered, &wb->state);
/*
* The root wb determines the registered state of the whole bdi and
* memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate
* whether they're still online. Don't link @wb if any is dead.
* See wb_memcg_offline() and wb_blkcg_offline().
*/
ret = -ENODEV;
spin_lock_irqsave(&cgwb_lock, flags);
if (test_bit(WB_registered, &bdi->wb.state) &&
blkcg_cgwb_list->next && memcg_cgwb_list->next) {
/* we might have raced another instance of this function */
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
if (!ret) {
atomic_inc(&bdi->usage_cnt);
list_add(&wb->memcg_node, memcg_cgwb_list);
list_add(&wb->blkcg_node, blkcg_cgwb_list);
css_get(memcg_css);
css_get(blkcg_css);
}
}
spin_unlock_irqrestore(&cgwb_lock, flags);
if (ret) {
if (ret == -EEXIST)
ret = 0;
goto err_put_congested;
}
goto out_put;
err_put_congested:
wb_congested_put(wb->congested);
err_fprop_exit:
fprop_local_destroy_percpu(&wb->memcg_completions);
err_ref_exit:
percpu_ref_exit(&wb->refcnt);
err_wb_exit:
wb_exit(wb);
err_free:
kfree(wb);
out_put:
css_put(blkcg_css);
return ret;
}
/**
* wb_get_create - get wb for a given memcg, create if necessary
* @bdi: target bdi
* @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref)
* @gfp: allocation mask to use
*
* Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to
* create one. The returned wb has its refcount incremented.
*
* This function uses css_get() on @memcg_css and thus expects its refcnt
* to be positive on invocation. IOW, rcu_read_lock() protection on
* @memcg_css isn't enough. try_get it before calling this function.
*
* A wb is keyed by its associated memcg. As blkcg implicitly enables
* memcg on the default hierarchy, memcg association is guaranteed to be
* more specific (equal or descendant to the associated blkcg) and thus can
* identify both the memcg and blkcg associations.
*
* Because the blkcg associated with a memcg may change as blkcg is enabled
* and disabled closer to root in the hierarchy, each wb keeps track of
* both the memcg and blkcg associated with it and verifies the blkcg on
* each lookup. On mismatch, the existing wb is discarded and a new one is
* created.
*/
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css,
gfp_t gfp)
{
struct bdi_writeback *wb;
might_sleep_if(gfp & __GFP_WAIT);
if (!memcg_css->parent)
return &bdi->wb;
do {
rcu_read_lock();
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
if (wb) {
struct cgroup_subsys_state *blkcg_css;
/* see whether the blkcg association has changed */
blkcg_css = cgroup_get_e_css(memcg_css->cgroup,
&blkio_cgrp_subsys);
if (unlikely(wb->blkcg_css != blkcg_css ||
!wb_tryget(wb)))
wb = NULL;
css_put(blkcg_css);
}
rcu_read_unlock();
} while (!wb && !cgwb_create(bdi, memcg_css, gfp));
return wb;
}
static void cgwb_bdi_init(struct backing_dev_info *bdi)
{
bdi->wb.memcg_css = mem_cgroup_root_css;
bdi->wb.blkcg_css = blkcg_root_css;
bdi->wb_congested.blkcg_id = 1;
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
atomic_set(&bdi->usage_cnt, 1);
}
static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
void **slot;
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
spin_lock_irq(&cgwb_lock);
radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
cgwb_kill(*slot);
spin_unlock_irq(&cgwb_lock);
/*
* All cgwb's and their congested states must be shutdown and
* released before returning. Drain the usage counter to wait for
* all cgwb's and cgwb_congested's ever created on @bdi.
*/
atomic_dec(&bdi->usage_cnt);
wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
}
/**
* wb_memcg_offline - kill all wb's associated with a memcg being offlined
* @memcg: memcg being offlined
*
* Also prevents creation of any new wb's associated with @memcg.
*/
void wb_memcg_offline(struct mem_cgroup *memcg)
{
LIST_HEAD(to_destroy);
struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
struct bdi_writeback *wb, *next;
spin_lock_irq(&cgwb_lock);
list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
cgwb_kill(wb);
memcg_cgwb_list->next = NULL; /* prevent new wb's */
spin_unlock_irq(&cgwb_lock);
}
/**
* wb_blkcg_offline - kill all wb's associated with a blkcg being offlined
* @blkcg: blkcg being offlined
*
* Also prevents creation of any new wb's associated with @blkcg.
*/
void wb_blkcg_offline(struct blkcg *blkcg)
{
LIST_HEAD(to_destroy);
struct bdi_writeback *wb, *next;
spin_lock_irq(&cgwb_lock);
list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
cgwb_kill(wb);
blkcg->cgwb_list.next = NULL; /* prevent new wb's */
spin_unlock_irq(&cgwb_lock);
}
#else /* CONFIG_CGROUP_WRITEBACK */
static void cgwb_bdi_init(struct backing_dev_info *bdi) { }
static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
#endif /* CONFIG_CGROUP_WRITEBACK */
int bdi_init(struct backing_dev_info *bdi)
{
int err;
bdi->dev = NULL;
bdi->min_ratio = 0;
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
INIT_LIST_HEAD(&bdi->bdi_list);
init_waitqueue_head(&bdi->wb_waitq);
err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
if (err)
return err;
bdi->wb_congested.state = 0;
bdi->wb.congested = &bdi->wb_congested;
cgwb_bdi_init(bdi);
return 0;
}
EXPORT_SYMBOL(bdi_init);
int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...)
{
@ -315,7 +779,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
bdi->dev = dev;
bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
set_bit(WB_registered, &bdi->wb.state);
spin_lock_bh(&bdi_lock);
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
@ -333,103 +797,23 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
EXPORT_SYMBOL(bdi_register_dev);
/*
* Remove bdi from the global list and shutdown any threads we have running
* Remove bdi from bdi_list, and ensure that it is no longer visible
*/
static void bdi_wb_shutdown(struct backing_dev_info *bdi)
static void bdi_remove_from_list(struct backing_dev_info *bdi)
{
/* Make sure nobody queues further work */
spin_lock_bh(&bdi->wb_lock);
if (!test_and_clear_bit(BDI_registered, &bdi->state)) {
spin_unlock_bh(&bdi->wb_lock);
return;
}
spin_unlock_bh(&bdi->wb_lock);
spin_lock_bh(&bdi_lock);
list_del_rcu(&bdi->bdi_list);
spin_unlock_bh(&bdi_lock);
/*
* Make sure nobody finds us on the bdi_list anymore
*/
bdi_remove_from_list(bdi);
/*
* Drain work list and shutdown the delayed_work. At this point,
* @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
* is dying and its work_list needs to be drained no matter what.
*/
mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
flush_delayed_work(&bdi->wb.dwork);
synchronize_rcu_expedited();
}
static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
memset(wb, 0, sizeof(*wb));
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
INIT_LIST_HEAD(&wb->b_dirty_time);
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
}
/*
* Initial write bandwidth: 100 MB/s
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))
int bdi_init(struct backing_dev_info *bdi)
{
int i, err;
bdi->dev = NULL;
bdi->min_ratio = 0;
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
spin_lock_init(&bdi->wb_lock);
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->work_list);
bdi_wb_init(&bdi->wb, bdi);
for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
if (err)
goto err;
}
bdi->dirty_exceeded = 0;
bdi->bw_time_stamp = jiffies;
bdi->written_stamp = 0;
bdi->balanced_dirty_ratelimit = INIT_BW;
bdi->dirty_ratelimit = INIT_BW;
bdi->write_bandwidth = INIT_BW;
bdi->avg_write_bandwidth = INIT_BW;
err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
if (err) {
err:
while (i--)
percpu_counter_destroy(&bdi->bdi_stat[i]);
}
return err;
}
EXPORT_SYMBOL(bdi_init);
void bdi_destroy(struct backing_dev_info *bdi)
{
int i;
bdi_wb_shutdown(bdi);
bdi_set_min_ratio(bdi, 0);
WARN_ON(!list_empty(&bdi->work_list));
WARN_ON(delayed_work_pending(&bdi->wb.dwork));
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(&bdi->wb);
cgwb_bdi_destroy(bdi);
if (bdi->dev) {
bdi_debug_unregister(bdi);
@ -437,9 +821,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
bdi->dev = NULL;
}
for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
percpu_counter_destroy(&bdi->bdi_stat[i]);
fprop_local_destroy_percpu(&bdi->completions);
wb_exit(&bdi->wb);
}
EXPORT_SYMBOL(bdi_destroy);
@ -472,31 +854,31 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
static atomic_t nr_bdi_congested[2];
static atomic_t nr_wb_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
void clear_wb_congested(struct bdi_writeback_congested *congested, int sync)
{
enum bdi_state bit;
wait_queue_head_t *wqh = &congestion_wqh[sync];
enum wb_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
if (test_and_clear_bit(bit, &bdi->state))
atomic_dec(&nr_bdi_congested[sync]);
bit = sync ? WB_sync_congested : WB_async_congested;
if (test_and_clear_bit(bit, &congested->state))
atomic_dec(&nr_wb_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
wake_up(wqh);
}
EXPORT_SYMBOL(clear_bdi_congested);
EXPORT_SYMBOL(clear_wb_congested);
void set_bdi_congested(struct backing_dev_info *bdi, int sync)
void set_wb_congested(struct bdi_writeback_congested *congested, int sync)
{
enum bdi_state bit;
enum wb_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
if (!test_and_set_bit(bit, &bdi->state))
atomic_inc(&nr_bdi_congested[sync]);
bit = sync ? WB_sync_congested : WB_async_congested;
if (!test_and_set_bit(bit, &congested->state))
atomic_inc(&nr_wb_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
EXPORT_SYMBOL(set_wb_congested);
/**
* congestion_wait - wait for a backing_dev to become uncongested
@ -555,7 +937,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
* encountered in the current zone, yield if necessary instead
* of sleeping on the congestion queue
*/
if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
cond_resched();

View File

@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(bdi))
if (!inode_write_congested(mapping->host))
__filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);

View File

@ -100,6 +100,7 @@
* ->tree_lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
* ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat)
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping,
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
* is safe. The caller must hold the mapping's tree_lock.
* is safe. The caller must hold the mapping's tree_lock and
* mem_cgroup_begin_page_stat().
*/
void __delete_from_page_cache(struct page *page, void *shadow)
void __delete_from_page_cache(struct page *page, void *shadow,
struct mem_cgroup *memcg)
{
struct address_space *mapping = page->mapping;
@ -212,7 +215,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
* anyway will be cleared before returning page into buddy allocator.
*/
if (WARN_ON_ONCE(PageDirty(page)))
account_page_cleaned(page, mapping);
account_page_cleaned(page, mapping, memcg,
inode_to_wb(mapping->host));
}
/**
@ -226,14 +230,20 @@ void __delete_from_page_cache(struct page *page, void *shadow)
void delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
struct mem_cgroup *memcg;
unsigned long flags;
void (*freepage)(struct page *);
BUG_ON(!PageLocked(page));
freepage = mapping->a_ops->freepage;
spin_lock_irq(&mapping->tree_lock);
__delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
memcg = mem_cgroup_begin_page_stat(page);
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL, memcg);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
if (freepage)
freepage(page);
@ -283,7 +293,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
if (!mapping_cap_writeback_dirty(mapping))
return 0;
wbc_attach_fdatawrite_inode(&wbc, mapping->host);
ret = do_writepages(mapping, &wbc);
wbc_detach_inode(&wbc);
return ret;
}
@ -472,6 +484,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
struct mem_cgroup *memcg;
unsigned long flags;
pgoff_t offset = old->index;
freepage = mapping->a_ops->freepage;
@ -480,8 +494,9 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
new->mapping = mapping;
new->index = offset;
spin_lock_irq(&mapping->tree_lock);
__delete_from_page_cache(old, NULL);
memcg = mem_cgroup_begin_page_stat(old);
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(old, NULL, memcg);
error = radix_tree_insert(&mapping->page_tree, offset, new);
BUG_ON(error);
mapping->nrpages++;
@ -493,7 +508,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
__inc_zone_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(new))
__inc_zone_page_state(new, NR_SHMEM);
spin_unlock_irq(&mapping->tree_lock);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
mem_cgroup_migrate(old, new, true);
radix_tree_preload_end();
if (freepage)

View File

@ -17,6 +17,7 @@
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>

View File

@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
#define MEM_CGROUP_RECLAIM_RETRIES 5
static struct mem_cgroup *root_mem_cgroup __read_mostly;
struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
@ -90,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = {
"rss",
"rss_huge",
"mapped_file",
"dirty",
"writeback",
"swap",
};
@ -322,11 +324,6 @@ struct mem_cgroup {
* percpu counter.
*/
struct mem_cgroup_stat_cpu __percpu *stat;
/*
* used when a cpu is offlined or other synchronizations
* See mem_cgroup_read_stat().
*/
struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;
#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@ -346,6 +343,11 @@ struct mem_cgroup {
atomic_t numainfo_updating;
#endif
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
struct wb_domain cgwb_domain;
#endif
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
@ -596,6 +598,39 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
return &memcg->css;
}
/**
* mem_cgroup_css_from_page - css of the memcg associated with a page
* @page: page of interest
*
* If memcg is bound to the default hierarchy, css of the memcg associated
* with @page is returned. The returned css remains associated with @page
* until it is released.
*
* If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
* is returned.
*
* XXX: The above description of behavior on the default hierarchy isn't
* strictly true yet as replace_page_cache_page() can modify the
* association before @page is released even on the default hierarchy;
* however, the current and planned usages don't mix the the two functions
* and replace_page_cache_page() will soon be updated to make the invariant
* actually true.
*/
struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
{
struct mem_cgroup *memcg;
rcu_read_lock();
memcg = page->mem_cgroup;
if (!memcg || !cgroup_on_dfl(memcg->css.cgroup))
memcg = root_mem_cgroup;
rcu_read_unlock();
return &memcg->css;
}
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{
@ -795,15 +830,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
long val = 0;
int cpu;
get_online_cpus();
for_each_online_cpu(cpu)
for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->count[idx], cpu);
#ifdef CONFIG_HOTPLUG_CPU
spin_lock(&memcg->pcp_counter_lock);
val += memcg->nocpu_base.count[idx];
spin_unlock(&memcg->pcp_counter_lock);
#endif
put_online_cpus();
return val;
}
@ -813,15 +841,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
unsigned long val = 0;
int cpu;
get_online_cpus();
for_each_online_cpu(cpu)
for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->events[idx], cpu);
#ifdef CONFIG_HOTPLUG_CPU
spin_lock(&memcg->pcp_counter_lock);
val += memcg->nocpu_base.events[idx];
spin_unlock(&memcg->pcp_counter_lock);
#endif
put_online_cpus();
return val;
}
@ -2020,6 +2041,7 @@ again:
return memcg;
}
EXPORT_SYMBOL(mem_cgroup_begin_page_stat);
/**
* mem_cgroup_end_page_stat - finish a page state statistics transaction
@ -2038,6 +2060,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
rcu_read_unlock();
}
EXPORT_SYMBOL(mem_cgroup_end_page_stat);
/**
* mem_cgroup_update_page_stat - update page state statistics
@ -2178,37 +2201,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
mutex_unlock(&percpu_charge_mutex);
}
/*
* This function drains percpu counter value from DEAD cpu and
* move it to local cpu. Note that this function can be preempted.
*/
static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
{
int i;
spin_lock(&memcg->pcp_counter_lock);
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
long x = per_cpu(memcg->stat->count[i], cpu);
per_cpu(memcg->stat->count[i], cpu) = 0;
memcg->nocpu_base.count[i] += x;
}
for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
unsigned long x = per_cpu(memcg->stat->events[i], cpu);
per_cpu(memcg->stat->events[i], cpu) = 0;
memcg->nocpu_base.events[i] += x;
}
spin_unlock(&memcg->pcp_counter_lock);
}
static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
unsigned long action,
void *hcpu)
{
int cpu = (unsigned long)hcpu;
struct memcg_stock_pcp *stock;
struct mem_cgroup *iter;
if (action == CPU_ONLINE)
return NOTIFY_OK;
@ -2216,9 +2214,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
return NOTIFY_OK;
for_each_mem_cgroup(iter)
mem_cgroup_drain_pcp_counter(iter, cpu);
stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
return NOTIFY_OK;
@ -4004,6 +3999,98 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
}
#endif
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg)
{
return &memcg->cgwb_list;
}
static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
{
return wb_domain_init(&memcg->cgwb_domain, gfp);
}
static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
{
wb_domain_exit(&memcg->cgwb_domain);
}
static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
{
wb_domain_size_changed(&memcg->cgwb_domain);
}
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
if (!memcg->css.parent)
return NULL;
return &memcg->cgwb_domain;
}
/**
* mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
* @wb: bdi_writeback in question
* @pavail: out parameter for number of available pages
* @pdirty: out parameter for number of dirty pages
* @pwriteback: out parameter for number of pages under writeback
*
* Determine the numbers of available, dirty, and writeback pages in @wb's
* memcg. Dirty and writeback are self-explanatory. Available is a bit
* more involved.
*
* A memcg's headroom is "min(max, high) - used". The available memory is
* calculated as the lowest headroom of itself and the ancestors plus the
* number of pages already being used for file pages. Note that this
* doesn't consider the actual amount of available memory in the system.
* The caller should further cap *@pavail accordingly.
*/
void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
unsigned long *pdirty, unsigned long *pwriteback)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
unsigned long head_room = PAGE_COUNTER_MAX;
unsigned long file_pages;
*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
/* this should eventually include NR_UNSTABLE_NFS */
*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
(1 << LRU_ACTIVE_FILE));
while ((parent = parent_mem_cgroup(memcg))) {
unsigned long ceiling = min(memcg->memory.limit, memcg->high);
unsigned long used = page_counter_read(&memcg->memory);
head_room = min(head_room, ceiling - min(ceiling, used));
memcg = parent;
}
*pavail = file_pages + head_room;
}
#else /* CONFIG_CGROUP_WRITEBACK */
static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
{
return 0;
}
static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
{
}
static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
{
}
#endif /* CONFIG_CGROUP_WRITEBACK */
/*
* DO NOT USE IN NEW FILES.
*
@ -4388,9 +4475,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!memcg->stat)
goto out_free;
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
goto out_free_stat;
spin_lock_init(&memcg->pcp_counter_lock);
return memcg;
out_free_stat:
free_percpu(memcg->stat);
out_free:
kfree(memcg);
return NULL;
@ -4417,6 +4510,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
free_mem_cgroup_per_zone_info(memcg, node);
free_percpu(memcg->stat);
memcg_wb_domain_exit(memcg);
kfree(memcg);
}
@ -4449,6 +4543,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
mem_cgroup_root_css = &memcg->css;
page_counter_init(&memcg->memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
@ -4467,7 +4562,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
#endif
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
#endif
return &memcg->css;
free_out:
@ -4555,6 +4652,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
memcg_deactivate_kmem(memcg);
wb_memcg_offline(memcg);
}
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
@ -4588,6 +4687,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
memcg->low = 0;
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
memcg_wb_domain_size_changed(memcg);
}
#ifdef CONFIG_MMU
@ -4757,6 +4857,7 @@ static int mem_cgroup_move_account(struct page *page,
{
unsigned long flags;
int ret;
bool anon;
VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
@ -4782,15 +4883,33 @@ static int mem_cgroup_move_account(struct page *page,
if (page->mem_cgroup != from)
goto out_unlock;
anon = PageAnon(page);
spin_lock_irqsave(&from->move_lock, flags);
if (!PageAnon(page) && page_mapped(page)) {
if (!anon && page_mapped(page)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
nr_pages);
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
nr_pages);
}
/*
* move_lock grabbed above and caller set from->moving_account, so
* mem_cgroup_update_page_stat() will serialize updates to PageDirty.
* So mapping should be stable for dirty pages.
*/
if (!anon && PageDirty(page)) {
struct address_space *mapping = page_mapping(page);
if (mapping_cap_account_dirty(mapping)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY],
nr_pages);
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
nr_pages);
}
}
if (PageWriteback(page)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
nr_pages);
@ -5306,6 +5425,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
memcg->high = high;
memcg_wb_domain_size_changed(memcg);
return nbytes;
}
@ -5338,6 +5458,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (err)
return err;
memcg_wb_domain_size_changed(memcg);
return nbytes;
}

File diff suppressed because it is too large Load Diff

View File

@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping,
/*
* Defer asynchronous read-ahead on IO congestion.
*/
if (bdi_read_congested(inode_to_bdi(mapping->host)))
if (inode_read_congested(mapping->host))
return;
/* do read-ahead */

View File

@ -30,6 +30,8 @@
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
* mapping->tree_lock (widely used)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c)

View File

@ -116,9 +116,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
* the VM has canceled the dirty bit (eg ext3 journaling).
* Hence dirty accounting check is placed after invalidation.
*/
if (TestClearPageDirty(page))
account_page_cleaned(page, mapping);
cancel_dirty_page(page);
ClearPageMappedToDisk(page);
delete_from_page_cache(page);
return 0;
@ -512,19 +510,24 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
static int
invalidate_complete_page2(struct address_space *mapping, struct page *page)
{
struct mem_cgroup *memcg;
unsigned long flags;
if (page->mapping != mapping)
return 0;
if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
spin_lock_irq(&mapping->tree_lock);
memcg = mem_cgroup_begin_page_stat(page);
spin_lock_irqsave(&mapping->tree_lock, flags);
if (PageDirty(page))
goto failed;
BUG_ON(page_has_private(page));
__delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
__delete_from_page_cache(page, NULL, memcg);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
if (mapping->a_ops->freepage)
mapping->a_ops->freepage(page);
@ -532,7 +535,8 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
page_cache_release(page); /* pagecache ref */
return 1;
failed:
spin_unlock_irq(&mapping->tree_lock);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
return 0;
}

View File

@ -154,11 +154,42 @@ static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup;
}
/**
* sane_reclaim - is the usual dirty throttling mechanism operational?
* @sc: scan_control in question
*
* The normal page dirty throttling mechanism in balance_dirty_pages() is
* completely broken with the legacy memcg and direct stalling in
* shrink_page_list() is used for throttling instead, which lacks all the
* niceties such as fairness, adaptive pausing, bandwidth proportional
* allocation and configurability.
*
* This function tests whether the vmscan currently in progress can assume
* that the normal dirty throttling mechanism is operational.
*/
static bool sane_reclaim(struct scan_control *sc)
{
struct mem_cgroup *memcg = sc->target_mem_cgroup;
if (!memcg)
return true;
#ifdef CONFIG_CGROUP_WRITEBACK
if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup))
return true;
#endif
return false;
}
#else
static bool global_reclaim(struct scan_control *sc)
{
return true;
}
static bool sane_reclaim(struct scan_control *sc)
{
return true;
}
#endif
static unsigned long zone_reclaimable_pages(struct zone *zone)
@ -452,14 +483,13 @@ static inline int is_page_cache_freeable(struct page *page)
return page_count(page) - page_has_private(page) == 2;
}
static int may_write_to_queue(struct backing_dev_info *bdi,
struct scan_control *sc)
static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
{
if (current->flags & PF_SWAPWRITE)
return 1;
if (!bdi_write_congested(bdi))
if (!inode_write_congested(inode))
return 1;
if (bdi == current->backing_dev_info)
if (inode_to_bdi(inode) == current->backing_dev_info)
return 1;
return 0;
}
@ -538,7 +568,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
if (!may_write_to_queue(inode_to_bdi(mapping->host), sc))
if (!may_write_to_inode(mapping->host, sc))
return PAGE_KEEP;
if (clear_page_dirty_for_io(page)) {
@ -579,10 +609,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
unsigned long flags;
struct mem_cgroup *memcg;
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
spin_lock_irq(&mapping->tree_lock);
memcg = mem_cgroup_begin_page_stat(page);
spin_lock_irqsave(&mapping->tree_lock, flags);
/*
* The non racy check for a busy page.
*
@ -620,7 +654,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
swapcache_free(swap);
} else {
void (*freepage)(struct page *);
@ -640,8 +675,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
spin_unlock_irq(&mapping->tree_lock);
__delete_from_page_cache(page, shadow, memcg);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
if (freepage != NULL)
freepage(page);
@ -650,7 +686,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
return 1;
cannot_free:
spin_unlock_irq(&mapping->tree_lock);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
mem_cgroup_end_page_stat(memcg);
return 0;
}
@ -917,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
mapping = page_mapping(page);
if (((dirty || writeback) && mapping &&
bdi_write_congested(inode_to_bdi(mapping->host))) ||
inode_write_congested(mapping->host)) ||
(writeback && PageReclaim(page)))
nr_congested++;
@ -935,10 +972,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* note that the LRU is being scanned too quickly and the
* caller can stall after page list has been processed.
*
* 2) Global reclaim encounters a page, memcg encounters a
* page that is not marked for immediate reclaim or
* the caller does not have __GFP_IO. In this case mark
* the page for immediate reclaim and continue scanning.
* 2) Global or new memcg reclaim encounters a page that is
* not marked for immediate reclaim or the caller does not
* have __GFP_IO. In this case mark the page for immediate
* reclaim and continue scanning.
*
* __GFP_IO is checked because a loop driver thread might
* enter reclaim, and deadlock if it waits on a page for
@ -952,7 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
* may_enter_fs here is liable to OOM on them.
*
* 3) memcg encounters a page that is not already marked
* 3) Legacy memcg encounters a page that is not already marked
* PageReclaim. memcg does not have any dirty pages
* throttling so we could easily OOM just because too many
* pages are in writeback and there is nothing else to
@ -967,7 +1004,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
/* Case 2 above */
} else if (global_reclaim(sc) ||
} else if (sane_reclaim(sc) ||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
/*
* This is slightly racy - end_page_writeback()
@ -1416,7 +1453,7 @@ static int too_many_isolated(struct zone *zone, int file,
if (current_is_kswapd())
return 0;
if (!global_reclaim(sc))
if (!sane_reclaim(sc))
return 0;
if (file) {
@ -1608,10 +1645,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
set_bit(ZONE_WRITEBACK, &zone->flags);
/*
* memcg will stall in page writeback so only consider forcibly
* stalling for global reclaim
* Legacy memcg will stall in page writeback so avoid forcibly
* stalling here.
*/
if (global_reclaim(sc)) {
if (sane_reclaim(sc)) {
/*
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.