mirror of
https://github.com/torvalds/linux.git
synced 2024-11-15 16:41:58 +00:00
writeback, blkio: add documentation for cgroup writeback support
Update Documentation/cgroups/blkio-controller.txt to reflect the recently added cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
This commit is contained in:
parent
46b15caa7c
commit
3e1534cf4a
@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
|
||||
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
|
||||
on individual groups and throughput should improve.
|
||||
|
||||
What works
|
||||
==========
|
||||
- Currently only sync IO queues are support. All the buffered writes are
|
||||
still system wide and not per group. Hence we will not see service
|
||||
differentiation between buffered writes between groups.
|
||||
Writeback
|
||||
=========
|
||||
|
||||
Page cache is dirtied through buffered writes and shared mmaps and
|
||||
written asynchronously to the backing filesystem by the writeback
|
||||
mechanism. Writeback sits between the memory and IO domains and
|
||||
regulates the proportion of dirty memory by balancing dirtying and
|
||||
write IOs.
|
||||
|
||||
On traditional cgroup hierarchies, relationships between different
|
||||
controllers cannot be established making it impossible for writeback
|
||||
to operate accounting for cgroup resource restrictions and all
|
||||
writeback IOs are attributed to the root cgroup.
|
||||
|
||||
If both the blkio and memory controllers are used on the v2 hierarchy
|
||||
and the filesystem supports cgroup writeback, writeback operations
|
||||
correctly follow the resource restrictions imposed by both memory and
|
||||
blkio controllers.
|
||||
|
||||
Writeback examines both system-wide and per-cgroup dirty memory status
|
||||
and enforces the more restrictive of the two. Also, writeback control
|
||||
parameters which are absolute values - vm.dirty_bytes and
|
||||
vm.dirty_background_bytes - are distributed across cgroups according
|
||||
to their current writeback bandwidth.
|
||||
|
||||
There's a peculiarity stemming from the discrepancy in ownership
|
||||
granularity between memory controller and writeback. While memory
|
||||
controller tracks ownership per page, writeback operates on inode
|
||||
basis. cgroup writeback bridges the gap by tracking ownership by
|
||||
inode but migrating ownership if too many foreign pages, pages which
|
||||
don't match the current inode ownership, have been encountered while
|
||||
writing back the inode.
|
||||
|
||||
This is a conscious design choice as writeback operations are
|
||||
inherently tied to inodes making strictly following page ownership
|
||||
complicated and inefficient. The only use case which suffers from
|
||||
this compromise is multiple cgroups concurrently dirtying disjoint
|
||||
regions of the same inode, which is an unlikely use case and decided
|
||||
to be unsupported. Note that as memory controller assigns page
|
||||
ownership on the first use and doesn't update it until the page is
|
||||
released, even if cgroup writeback strictly follows page ownership,
|
||||
multiple cgroups dirtying overlapping areas wouldn't work as expected.
|
||||
In general, write-sharing an inode across multiple cgroups is not well
|
||||
supported.
|
||||
|
||||
Filesystem support for cgroup writeback
|
||||
---------------------------------------
|
||||
|
||||
A filesystem can make writeback IOs cgroup-aware by updating
|
||||
address_space_operations->writepage[s]() to annotate bio's using the
|
||||
following two functions.
|
||||
|
||||
* wbc_init_bio(@wbc, @bio)
|
||||
|
||||
Should be called for each bio carrying writeback data and associates
|
||||
the bio with the inode's owner cgroup. Can be called anytime
|
||||
between bio allocation and submission.
|
||||
|
||||
* wbc_account_io(@wbc, @page, @bytes)
|
||||
|
||||
Should be called for each data segment being written out. While
|
||||
this function doesn't care exactly when it's called during the
|
||||
writeback session, it's the easiest and most natural to call it as
|
||||
data segments are added to a bio.
|
||||
|
||||
With writeback bio's annotated, cgroup support can be enabled per
|
||||
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
|
||||
selective disabling of cgroup writeback support which is helpful when
|
||||
certain filesystem features, e.g. journaled data mode, are
|
||||
incompatible.
|
||||
|
||||
wbc_init_bio() binds the specified bio to its cgroup. Depending on
|
||||
the configuration, the bio may be executed at a lower priority and if
|
||||
the writeback session is holding shared resources, e.g. a journal
|
||||
entry, may lead to priority inversion. There is no one easy solution
|
||||
for the problem. Filesystems can try to work around specific problem
|
||||
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
|
||||
directly.
|
||||
|
Loading…
Reference in New Issue
Block a user