linux/block
Shaohua Li 7394e31fa4 blk-throttle: make bandwidth change smooth
When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Scale up is linear. The limit scales up 1/2 .low limit every
throtl_slice after upgrade. The scale up will stop if the adjusted limit
hits .max limit. Scale down is exponential. We cut the scale value half
if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
fully downgrade the queue to LIMIT_LOW state.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-28 08:02:20 -06:00
..
partitions partitions/efi: Fix integer overflow in GPT size calculation 2017-01-17 09:02:31 -07:00
badblocks.c badblocks: badblocks_set/clear update unacked_exist 2016-10-21 15:45:47 -06:00
bio-integrity.c block: remove bio_is_rw 2016-10-28 08:45:17 -06:00
bio.c block: make nr_iovecs unsigned in bio_alloc_bioset() 2017-03-23 08:16:11 -06:00
blk-cgroup.c sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
blk-core.c block: fix stacked driver stats init and free 2017-03-21 17:20:01 -06:00
blk-exec.c block: introduce blk_rq_is_passthrough 2017-01-31 14:00:34 -07:00
blk-flush.c block: remove outdated part of blkdev_issue_flush() comment 2017-03-24 15:41:30 -06:00
blk-integrity.c block: constify struct blk_integrity_profile 2017-03-24 20:34:39 -06:00
blk-ioc.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-03-03 10:53:35 -08:00
blk-lib.c block: correct documentation for blkdev_issue_discard() flags 2017-03-24 15:41:28 -06:00
blk-map.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> 2017-03-02 08:42:36 +01:00
blk-merge.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-mq-cpumap.c blk-mq: export blk_mq_map_queues 2016-11-08 17:30:00 -05:00
blk-mq-debugfs.c blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-mq-pci.c blk_mq: linux/blk-mq.h does not include all the headers it depends on 2016-09-19 08:21:51 -06:00
blk-mq-sched.c blk-mq: move update of tags->rqs to __blk_mq_alloc_request() 2017-03-02 08:56:04 -07:00
blk-mq-sched.h blk-mq-sched: separate mark hctx and queue restart operations 2017-02-23 11:55:47 -07:00
blk-mq-sysfs.c blk-mq: free hctx->cpumask in release handler of hctx's kobject 2017-03-08 09:56:12 -07:00
blk-mq-tag.c blk-mq: Fix tagset reinit in the presence of cpu hot-unplug 2017-03-13 08:14:23 -06:00
blk-mq-tag.h blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset 2017-03-02 08:56:04 -07:00
blk-mq-virtio.c blk-mq: provide a default queue mapping for virtio device 2017-02-27 20:54:05 +02:00
blk-mq.c blk-mq: streamline blk_mq_make_request 2017-03-22 20:17:03 -06:00
blk-mq.h blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-settings.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-softirq.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h> 2017-03-02 08:42:26 +01:00
blk-stat.c block: fix stacked driver stats init and free 2017-03-21 17:20:01 -06:00
blk-stat.h blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-sysfs.c blk-throttle: choose a small throtl_slice for SSD 2017-03-28 08:02:20 -06:00
blk-tag.c blk-mq-sched: add framework for MQ capable IO schedulers 2017-01-17 10:04:20 -07:00
blk-throttle.c blk-throttle: make bandwidth change smooth 2017-03-28 08:02:20 -06:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk-wbt.c blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-wbt.h blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-zoned.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
blk.h blk-throttle: choose a small throtl_slice for SSD 2017-03-28 08:02:20 -06:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c block: split scsi_request out of struct request 2017-01-27 15:08:35 -07:00
bsg.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
cfq-iosched.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> 2017-03-02 08:42:27 +01:00
cmdline-parser.c block: remove unrelated header files and export symbol 2014-01-21 20:18:26 -08:00
compat_ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
deadline-iosched.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
elevator.c block: don't call ioc_exit_icq() with the queue lock held for blk-mq 2017-03-02 13:59:08 -07:00
genhd.c block: Fix oops scsi_disk_get() 2017-03-22 20:11:37 -06:00
ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
ioprio.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
Kconfig blk-throttle: add configure option for new .low interface 2017-03-28 08:02:20 -06:00
Kconfig.iosched block: get rid of blk-mq default scheduler choice Kconfig entries 2017-02-22 13:19:45 -07:00
Makefile virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
mq-deadline.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
noop-iosched.c block: move existing elevator ops to union 2017-01-17 10:03:33 -07:00
opal_proto.h block/sed-opal: allocate struct opal_dev dynamically 2017-02-17 12:41:47 -07:00
partition-generic.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
scsi_ioctl.c block: fold cmd_type into the REQ_OP_ space 2017-01-31 14:00:44 -07:00
sed-opal.c block/sed: Fix opal user range check and unused variables 2017-03-08 09:56:12 -07:00
t10-pi.c block: constify struct blk_integrity_profile 2017-03-24 20:34:39 -06:00