The struct persistent_gnt flags member is meant to be a bitfield of
different flags. There is only PERSISTENT_GNT_ACTIVE flag left, so
convert it to a bool named "active".
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In case we don't want pv block devices we should not test parameters
for sanity and eventually print out error messages. So test precluding
conditions before checking parameters.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add a periodic cleanup function to remove old persistent grants which
are no longer in use on the backend side. This avoids starvation in
case there are lots of persistent grants for a device which no longer
is involved in I/O business.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Persistent grants are allocated until a threshold per ring is being
reached. Those grants won't be freed until the ring is being destroyed
meaning there will be resources kept busy which might no longer be
used.
Instead of freeing only persistent grants until the threshold is
reached add a timestamp and remove all persistent grants not having
been in use for a minute.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
sparse complains:
drivers/block/null_blk_main.c:816:24: sparse: context imbalance in 'null_insert_page' - unexpected unlock
Fix it by adding the necessary annotations to the function.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 1056543 ("Missing break in switch")
Addresses-Coverity-ID: 1056544 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kmem_cache_destroy has taken null pointer into account. So it is
safe to drop the null check before calling the function.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mempool_destroy has taken the null pointer into account. So it is safe
to remove the null check.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
debugfs_remove_recursive has taken null pointer into account. So it is
safe to drop the null check before calling the function.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return statements in functions returning bool should use true or false
instead of an integer value.
This code was detected with the help of Coccinelle.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This removes more casts of struct request_sense and uses the standard
struct scsi_sense_hdr instead. This also fixes any possible stale values
since the prior code did not check the sense length.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a lot of needless struct request_sense usage in the CDROM
code. These can all be struct scsi_sense_hdr instead, to avoid any
confusion over their respective structure sizes. This patch is a lot
of noise changing "sense" to "sshdr", but the final code is more
readable to distinguish between "sense" meaning "struct request_sense"
and "sshdr" meaning "struct scsi_sense_hdr".
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 114722 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
User controls @dev_minor which to be used as index of pkt_devs.
So, It can be exploited via Spectre-like attack. (speculative execution)
This kind of attack leaks address of pkt_devs, [1]
It leads an attacker to bypass security mechanism such as KASLR.
So sanitize @dev_minor before using it to prevent attack.
[1] https://github.com/jinb-park/linux-exploit/
tree/master/exploit-remaining-spectre-gadget/leak_pkt_devs.c
Signed-off-by: Jinbum Park <jinb.park7@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove some macros not used anywhere.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit ca4b2a0119 ("null_blk: add zone support") breaks several
blktests scripts because it renamed the null_blk kernel module into
null_blk_mod. Hence rename null_blk_mod back into null_blk.
Fixes: ca4b2a0119 ("null_blk: add zone support")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Matias Bjorling <matias.bjorling@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.
In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.
Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().
tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries. For example to sum up the number read and write
sectors completed. In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.
tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
c11f0c0b5b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.
Unfortunately, we want to track discards separately and @is_write
isn't enough information. This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Remove checkpatch errors caused due to assignment operation in if
condition
Signed-off-by: RAGHU Halharvi <raghuhack78@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc64 platform:
drivers/block/skd_main.c: In function 'skd_preop_sg_list':
drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=]
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.
../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adds support for exposing a null_blk device through the zone device
interface.
The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the null_blk device driver, such that it can prepare for
zoned block interface support.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently mbps knob could only be set once before switching power knob to
on, after power knob has been set at least once, there is no way to set
mbps knob again due to -EBUSY.
As nullb is mainly used for testing, in order to make it flexible, this
removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
when power knob is off, e.g.
echo 0 > /config/nullb/a/power
echo 40 > /config/nullb/a/mbps
echo 1 > /config/nullb/a/power
So does other knobs under /config/nullb/a.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Warning level 2 was used in this case: -Wimplicit-fallthrough=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Variable n is being assigned but is never used hence it is redundant
and can be removed. Also put spacing between variables in declaration
to clean up checkpatch warnings.
Cleans up clang warning:
warning: variable 'n' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pointer inode is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'inode' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Variable dflags is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'dflags' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since __must_hold() is defined in <linux/compiler_types.h>, do not
redefine it in DRBD. Compile-tested only.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change adds LOOP_SET_BLOCK_SIZE as one of the supported ioctls
in lo_compat_ioctl. It only takes an unsigned long argument, and
in practice a 32-bit value works fine.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have
struct drbd_requests { ... struct bio *private_bio; ... }
to hold a bio clone for local submission.
On local IO completion, we put that bio, and in case we want to use the
result later, we overload that member to hold the ERR_PTR() of the
completion result,
Which, before v4.3, used to be the passed in "int error",
so we could first bio_put(), then assign.
v4.3-rc1~100^2~21 4246a0b63b block: add a bi_error field to struct bio
changed that:
bio_put(req->private_bio);
- req->private_bio = ERR_PTR(error);
+ req->private_bio = ERR_PTR(bio->bi_error);
Which introduces an access after free,
because it was non obvious that req->private_bio == bio.
Impact of that was mostly unnoticable, because we only use that value
in a multiple-failure case, and even then map any "unexpected" error
code to EIO, so worst case we could potentially mask a more specific
error with EIO in a multiple failure case.
Unless the pointed to memory region was unmapped, as is the case with
CONFIG_DEBUG_PAGEALLOC, in which case this results in
BUG: unable to handle kernel paging request
v4.13-rc1~70^2~75 4e4cbee93d block: switch bios to blk_status_t
changes it further to
bio_put(req->private_bio);
req->private_bio = ERR_PTR(blk_status_to_errno(bio->bi_status));
And blk_status_to_errno() now contains a WARN_ON_ONCE() for unexpected
values, which catches this "sometimes", if the memory has been reused
quickly enough for other things.
Should also go into stable since 4.3, with the trivial change around 4.13.
Cc: stable@vger.kernel.org
Fixes: 4246a0b63b block: add a bi_error field to struct bio
Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAls2oVcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr89D/9PaJCtLpCEU1HdaohIDXDyJKBdtCllIsqJ
YAJTlUkFRkMvsbdEA3U43rVRVlu7zxP3aRg79VRI+kdZ8y8XSiisAF/QUbG/Bwd3
NuFqKj4Hm5xFTtLMKIUlumR2/exD4ZpPHKnWmD4idp1PZRoTUwxeIjqVv77L6Nfy
c4/GVlz2t9buWxH/5PSa//rsT49xM6CLpyIwO2lFsNEk2HYE/uAWya5KZVJ1YKVw
mSjUdyFtJ//lFK9lVU4YkdNF0d5w2PcEoCfNyPe0n9F43iITACo2om7rkSkTgciS
izPAs1DkqNdWPta2twULt656WlUoGlwnWyFchU34N/I/pDiww4kdCQFfNIOi6qBW
7rPQ4xpywiw/U93C2GLEeE09++T8+yO4KuELhIcN5+D3D2VGRIuaGcf4xeY0CX3A
a335EZIxBGOfxyejknBDg3BsoUcLvejbANKD8ltis3zciGf/QyLt+FFqx25WCzkN
N028ppml8WPSiqhSlFDMyfscO1dx/WSYh1q7ANrBiPPaEjFYmakzEDT0cZW4JsUh
av4jqYt3cqrFIXyhRHJM2wjFymQv6aVnmLa4zOKCFbwm9KzMWUUFjc4R962T/sQK
0g+eCSFGidii4JZ4ghOAQk2dqCTIZqV72pmLlpJBGu+SAIQxkh+29h0LOi5U3v1Y
FnTtJ1JhCg==
=0uqV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Small set of fixes for this series. Mostly just minor fixes, the only
oddball in here is the sg change.
The sg change came out of the stall fix for NVMe, where we added a
mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
sort-of ruins that, since we'd need to account for that. That's
actually a generic problem, since lots of drivers need to allocate SG
lists. So this just removes support for CONFIG_SG_DEBUG, which I added
back in 2007 and to my knowledge it was never useful.
Anyway, outside of that, this pull contains:
- clone of request with special payload fix (Bart)
- drbd discard handling fix (Bart)
- SATA blk-mq stall fix (me)
- chunk size fix (Keith)
- double free nvme rdma fix (Sagi)"
* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
sg: remove ->sg_magic member
drbd: Fix drbd_request_prepare() discard handling
blk-mq: don't queue more if we get a busy return
block: Fix cloning of requests with a special payload
nvme-rdma: fix possible double free of controller async event buffer
block: Fix transfer when chunk sectors exceeds max
Fix the test that verifies whether bio_op(bio) represents a discard
or write zeroes operation. Compile-tested only.
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Fixes: 7435e9018f ("drbd: zero-out partial unaligned discards on local backend")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsudKcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphYXEAC0G+fJcLgue64LwLNjXksDUsoldqNWjTDZ
e3szwseN3Woe7t+Q9DwEUO48T421WhDYe3UY50EZF3Fz5yYLi0ggTeq4qsKgpaXx
80dUY9/STxFZUV/tKrYMz5xPjM3k5x+m8FTeGlFgyR77fbOZE6/8w9XPkpmQN/B2
ayyEPMgAR9aZKhtOU4TJY45O278b2Qdfd0s6XUsxDTwBoR1mTQqKhNnY38PrIA5s
wkhtKz5YRdq5tNhKt2IwYABpJ4HJ1Hz/FE6Fo1RNx+6vfxfEhsk2/DZBn0rzzGDj
kvSaVlPeQEu9SZdJGtb2sx26elVWdt5sqMORQ0zb1kVCochZiT1IO3Z81XCy2Y3u
5WVvOlgsGni9BWr/K+iiN/dc+cKoRPy0g/ib77XUiJHgIC+urUIr3XgcIiuP9GZc
cgCfF+6+GvRKLatSsDZSbns5cpgueJLNogZUyhs7oUouRIECo0DD9NnDAlHmaZYB
957bRd/ykE6QOvtxEsG0TWc3tgyHjJFQQ9ukb45VmzYVB8mITKOGLpceJaqzju9c
AnGyQDddK4fqOct1pgFsSA5C3V6OOVXDpUiFn3DTNoQZcu5HN2hMwEA/g/u5TnGD
N2sEwcu2WCoUjzsMEWFLmX8R4w8laZUDwkNSRC5tYzQEZpVDfmAsRtRcUHJdNLl/
zPj8QMAidA==
=hCgr
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Further timeout fixes. We aren't quite there yet, so expect another
round of fixes for that to completely close some of the IRQ vs
completion races. (Christoph/Bart)
- Set of NVMe fixes from the usual suspects, mostly error handling
- Two off-by-one fixes (Dan)
- Another bdi race fix (Jan)
- Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)
* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
bdi: Fix another oops in wb_workfn()
lightnvm: Remove depends on HAS_DMA in case of platform dependency
nvme-pci: limit max IO size and segments to avoid high order allocations
nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
nvme-fc: release io queues to allow fast fail
nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
block: sed-opal: Fix a couple off by one bugs
blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
nvmet: reset keep alive timer in controller enable
nvme-rdma: don't override opts->queue_size
nvme-rdma: Fix command completion race at error recovery
nvme-rdma: fix possible free of a non-allocated async event buffer
nvme-rdma: fix possible double free condition when failing to create a controller
Revert "block: Add warning for bi_next not NULL in bio_endio()"
block: fix timeout changes for legacy request drivers
If NBD_DISCONNECT_ON_CLOSE is set on a device, then the driver will
issue a disconnect from nbd_release if the device has no remaining
bdev->bd_openers.
Fix ret val so reconfigure with only setting the flag succeeds.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_complete_request can only be called for blk-mq drivers, but when
removing the BLK_EH_HANDLED return value, two legacy request timeout
methods incorrectly got switched to call blk_mq_complete_request.
Call __blk_complete_request instead to reinstance the previous behavior.
For that __blk_complete_request needs to be exported.
Fixes: 1fc2b62e ("scsi_transport_fc: complete requests from ->timeout")
Fixes: 0df0bb08 ("null_blk: complete requests from ->timeout")
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself). The rest is mostly
mount option handling cleanups from Chengguang and assorted fixes
from Zheng, Luis and Dongsheng.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJbIkigAAoJEEp/3jgCEfOL3EUH/1s7Ib3FgFzG/SPPKISxZOGr
ndZGg0rPT9mPIQ4rp6t0z/cDlMrluPmCK3sWrAPe//sZz9iZiuip+mCL0gUFXFNr
1kL2xDKkJzGxtP3UlUvr5CC6bnxLdeBXJRBDLk/swtphuqArKndlbN/iLZnCZivT
uJDk+vZTwNJ3UhQP4QdnOQLV60NYs+q4euTqbZF3+pDiRiONbxRfXC3adFsc8zL9
zlie3CHPbrQHWMsfNvbfM3rBH1WhTwEssDm+IEFlKl19q9SKP2WPZfmBcE1pmZ58
AhIMoNGdQha1FXS6N96kaPaqFgeysPnEPoyHDqLxsUMKqsvJlOEZsK1jujza4rE=
=EfXm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself).
The rest is mostly mount option handling cleanups from Chengguang and
assorted fixes from Zheng, Luis and Dongsheng.
* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
rbd: flush rbd_dev->watch_dwork after watch is unregistered
ceph: update description of some mount options
ceph: show ino32 if the value is different with default
ceph: strengthen rsize/wsize/readdir_max_bytes validation
ceph: fix alignment of rasize
ceph: fix use-after-free in ceph_statfs()
ceph: prevent i_version from going back
ceph: fix wrong check for the case of updating link count
libceph: allocate the locator string with GFP_NOFAIL
libceph: make abort_on_full a per-osdc setting
libceph: don't abort reads in ceph_osdc_abort_on_full()
libceph: avoid a use-after-free during map check
libceph: don't warn if req->r_abort_on_full is set
libceph: use for_each_request() in ceph_osdc_abort_on_full()
libceph: defer __complete_request() to a workqueue
libceph: move more code into __complete_request()
libceph: no need to call flush_workqueue() before destruction
ceph: flush pending works before shutdown super
ceph: abort osd requests on force umount
libceph: introduce ceph_osdc_abort_requests()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsa4sQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqPNEADbZby01Q6i+dTZYIosz5+/gq8gSkCcpQ/T
krK/f2MlwD7Rdog1BnGNNP5XOqK8pKIGdARL1FQKpViii6xGIoOc2F4VK+vO44yR
LI+BeeOM6rWNOAoBO4CqeZz/Fv5IYi7KURWogYhZMqrxBqT2OeD9MMowm5NulBix
YZ2ttFWiTJScJttJCDPE6cu9EjHDeK63Nr7+UU80k3atU4eUpUp1mRFGmtaYWulq
l3KaENCwm00WCVqM4i/gVWr2AkgTZqAAyeCx7IrPsrQrCMEhxEpMnU52e2kXSxhM
Qx6FLNEOjzARuBDurtfJE74usQcW2xDLzT8fh2UStnPpt6S/JX6f9GMBVk0G7I8B
8COF4DF+bzdbhhz2SiZaTFOmDML5H1iQ8t6lTTms0Bnq29mE3E4QFom8lO+2BxN3
g6PFhvYaOkhTVtV5BPXpXs9xZBLHrv5G/JopXsZh0RF1kpiova+nfA1K2uJPFpJ0
NcHuMZKmIG3uBqY3fj5Ul+zuVhZ/1v8B69zWoSWafLrk+VRdcEAniuY2E6SsQFP5
gV4GNja85S53DnlIVwEUXPYMiY6opiwP53yMNMvkB/FdzaQB5Ehdif2fhZu64QmE
TtqbHtAuV0VZ3z4GrJ3XNbV6Np4wMOhYls4lTkZsnqNNO2sw/eoTYcmwxLDEYOQw
uQ9rhZh4IQ==
=N3BP
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180608' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes for this merge window, where some of them should go in
sooner rather than later, hence a new pull this week. This pull
request contains:
- Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
changes, but also teardown/removal and misc changes (Christop/Dan/
Johannes/Sagi/Steve).
- Two lightnvm fixes for issues that showed up in this window
(Colin/Wei).
- Failfast/driver flags inheritance for flush requests (Hannes).
- The md device put sanitization and fix (Kent).
- dm bio_set inheritance fix (me).
- nbd discard granularity fix (Josef).
- nbd consistency in command printing (Kevin).
- Loop recursion validation fix (Ted).
- Partition overlap check (Wang)"
[ .. and now my build is warning-free again thanks to the md fix - Linus ]
* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
nvme: cleanup double shift issue
nvme-pci: make CMB SQ mod-param read-only
nvme-pci: unquiesce dead controller queues
nvme-pci: remove HMB teardown on reset
nvme-pci: queue creation fixes
nvme-pci: remove unnecessary completion doorbell check
nvme-pci: remove unnecessary nested locking
nvmet: filter newlines from user input
nvme-rdma: correctly check for target keyed sgl support
nvme: don't hold nvmf_transports_rwsem for more than transport lookups
nvmet: return all zeroed buffer when we can't find an active namespace
md: Unify mddev destruction paths
dm: use bioset_init_from_src() to copy bio_set
block: add bioset_init_from_src() helper
block: always set partition number to '0' in blk_partition_remap()
block: pass failfast and driver-specific flags to flush requests
nbd: set discard_alignment to the granularity
nbd: Consistently use request pointer in debug messages.
block: add verifier for cmdline partition
lightnvm: pblk: fix resource leak of invalid_bitmap
...
Pull aio iopriority support from Al Viro:
"The rest of aio stuff for this cycle - Adam's aio ioprio series"
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: aio ioprio use ioprio_check_cap ret val
fs: aio ioprio add explicit block layer dependence
fs: iomap dio set bio prio from kiocb prio
fs: blkdev set bio prio from kiocb prio
fs: Add aio iopriority support
fs: Convert kiocb rw_hint from enum to u16
block: add ioprio_check_cap function
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. Better idea is app developers free them directly rather than
remaining them on heap.
This patch tell us last access time of each block of zram via "cat
/sys/kernel/debug/zram/zram0/block_state".
The output is as follows,
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
First column is zram's block index and 3rh one represents symbol (s:
same page w: written page to backing store h: huge page) of the block
state. Second column represents usec time unit of the block was last
accessed. So above example means the 300th block is accessed at
75.033851 second and it was huge so it was written to the backing store.
Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.
I used the feature a few years ago to find memory hoggers in userspace
to notify them what memory they have wasted without touch for a long
time. With it, they could reduce unnecessary memory space. However, at
that time, I hacked up zram for the feature but now I need the feature
again so I decided it would be better to upstream rather than keeping it
alone. I hope I submit the userspace tool to use the feature soon.
[akpm@linux-foundation.org: fix i386 printk warning]
[minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()]
Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com
[akpm@linux-foundation.org: documentation tweak]
[akpm@linux-foundation.org: fix i386 printk warning]
[minchan@kernel.org: fix compile warning]
Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com
[rdunlap@infradead.org: fix printk formats]
Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org
Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. Better idea is app developers free them directly rather than
remaining them on heap.
This patch records last access time of each block of zram so that With
upcoming zram memory tracking, it could help userspace developers to
reduce memory footprint.
Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark incompressible pages so that we could investigate who is the owner
of the incompressible pages once the page is swapped out via using
upcoming zram memory tracker feature.
With it, we could prevent such pages to be swapped out by using mlock.
Otherwise we might remove them.
This patch exposes new stat for huge pages via mm_stat.
Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "zram memory tracking", v5.
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. As well, it's pointless to store incompressible pages to zram
so better idea is app developers manages them directly like free or
mlock rather than remaining them on heap.
This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state
to represent each block's state so admin can investigate what memory is
cold|incompressible|same page with using pagemap once the pages are
swapped out.
The output is as follows:
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
First column is zram's block index and 3rh one represents symbol (s:
same page w: written page to backing store h: huge page) of the block
state. Second column represents usec time unit of the block was last
accessed. So above example means the 300th block is accessed at
75.033851 second and it was huge so it was written to the backing store.
This patch (of 4):
ZRAM_ACCESS is used for locking a slot of zram so correct the name. It
is also not a common flag to indicate status of the block so move the
declare position on top of the flag. Lastly, let's move the function to
the top of source code to be able to use it easily without forward
declaration.
Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Technically we should be able to get away with 0 as the
discard_alignment, but there's no way currently for the protocol to
indicate different alignments, and in real life most disks have
discard_alignment == discard_granularity. Just set our alignment to our
blocksize to make sure discards will actually work properly with 4k
drives.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>