nbd_clear_sock_ioctl kills the socket and with that the block
device. Instead of just invalidating file system buffers,
mark the device as dead, which will also invalidate the buffers
as part of the proper shutdown sequence. This also includes
invalidating partitions if there are any.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-8-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
There isn't any reason to not support REQ_OP_ZONE_RESET_ALL given everything
is actually handled in userspace, not mention it is pretty easy to support
RESET_ALL.
So enable REQ_OP_ZONE_RESET_ALL and let userspace handle it.
Verified by 'tools/zbc_reset_zone -all /dev/ublkb0' in libzbc[1] with
libublk-rs based ublk-zoned target prototype[2], follows command line
for creating ublk-zoned:
cargo run --example zoned -- add -1 1024 # add $dev_id $DEV_SIZE
[1] https://github.com/westerndigitalcorporation/libzbc
[2] https://github.com/ming1/libublk-rs/tree/zoned.v2
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810124326.321472-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTg19oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmLjD/wKs0hA4JOQDqZPZK1p1aBU4f0vXwQxFGlL
+gcnO4/MIB/5Ud+T+SXYuMrimLws7xsVbymcGatiRjH8LTfJVXFhuAzLILi0AcHw
nhzjOUEzHokUex+tZLZRZxmavR+9SyGJoFNIbh+mY8JOLdNVzFDSqnLWO+D02Q2R
OOBupA0mLRelYODEm2rI4xlQndwfrOAoAyEv+R7Ug0F6bFSno36QOg64pmZVI0Fl
eudORXnIRYdtUajv+kNATWoqBbq/UCuBJdk0veM07Try6ZGRXRh6dQSA+GRh93pE
Zg3JAHj4MKwlP3/wglw3SzoeECHpZrKQavIQQe9pTWKP4xGI/jdbVBcyFE0ERc66
HijMo6CLeAzpOI1nEv+QhD8ntr4polEiWL4EVLuoXE9fVI1mYzavqmqrsDHeOHeF
IJHadXZwsTG243msDvqedy0RFBwAkpnK0XdQuDtMnSa7UHwWWbxwUOwO5p4COJ3g
vmrCfPQr7TTgkOtAXoMnwOZ1troEGxa/2CdUKaTdVG8RkMeM2qy8tmBBTV9Bx6+i
rwQbB/JJm5SE6DX309TRaR6w+5YiwR6e7ECKx5hdYXia7M3OxlBBvl1NOfiWjWE3
abC38/FReHLmFKHaDaN2AM1vLy+duc4NEc/yMQ4FDcfj/hUHQCoZBPYUsvlC+a4e
Ws4qoMLU8A==
=LnzH
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Main thing here is the fix for the regression in flush handling which
caused IO hangs/stalls for a few reporters. Hopefully that should all
be sorted out now. Outside of that, just a few minor fixes for issues
that were introduced in this cycle"
* tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux:
blk-mq: release scheduler resource when request completes
blk-crypto: dynamically allocate fallback profile
blk-cgroup: hold queue_lock when removing blkg->q_node
drivers/rnbd: restore sysfs interface to rnbd-client
Commit 137380c0ec renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.
CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 137380c0ec renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.
CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use memdup_user_nul() helper instead of open-coding
to simplify the code.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230815114815.1551171-1-ruanjinjie@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTWfLQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg3nEACROhaeX6cpeDCSTqDVDW/ontbyn15eX7ep
tGPLn/TVtKv2AztIobEinS08MdywqBO/VcB7XkxQV9Ov4JqCHIAKhndWI6/HqD9P
DH3h6tE5JA8RQlNw1aHRrqWWIl1lpDQI6263um1tB2TuaxRa4xuR560jju0VZzAm
9541ceKlJT8Qc7yG0aiiCv6Bxz+b6Htv3DqCf1mY2yznl3BpN52RQHKhiA0sfnlF
WKqNsvSJ9/kz3vJbNpFucO7ch8a7W+MzmBx0vf2ickTBpL/3hbhUOrE7dGeKI9rS
cWh1HaULWqjnKY1uxF9nnapZxm8QoxkT/5T0DgmprKjwuZivfLASAhYpHBc3mT1S
eQQ0AK8hqx7sPnPeO/kxWtxM2nzRLkeVd19ClbIwux/zDbRrpHWk2/wgnSUUd3/H
HBbjbgPWbkgLvTOUKhIA5VPBcgkC1efom1+ePzkH/H4TRRuVJwg6s6utGXdgc1PX
+B4TA8GtXRH/7L0tsblFyJRmd0Y6G7gYE/yy0DYZTMie3oaWrKx3lmz48AQUtEzh
DG46VRA4wnthHRlw3mkLP7C6z4PJvK9WWBiK11eZ9VfJMF643FNpXQ3/bviR9pfF
kXdwYXoi1mlnsQ0VUhu2f+JeV4hHalrjwD/VE2H0E8Ogb4ezmJteLyiZKcw5xwaA
Hmtmbb7Qxw==
=+1Vt
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fixes for request_queue state (Ming)
- Another uuid quirk (August)
- RCU poll fix for NVMe (Ming)
- Fix for an IO stall with polled IO (me)
- Fix for blk-iocost stats enable/disable accounting (Chengming)
- Regression fix for large pages for zram (Christoph)
* tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux:
nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
blk-iocost: fix queue stats accounting
block: don't make REQ_POLLED imply REQ_NOWAIT
block: get rid of unused plug->nowait flag
zram: take device and not only bvec offset into account
nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
nvme-rdma: fix potential unbalanced freeze & unfreeze
nvme-tcp: fix potential unbalanced freeze & unfreeze
nvme: fix possible hang when removing a controller during error recovery
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been
done after the request is confirmed as valid.
Actually here, the request should always been true, so add one
WARN_ON_ONCE(!req), meantime move the zone_append check after
checking the request.
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230811135216.420404-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the module init function, which by definition is used only
locally, so mark it static to avoid a warning:
drivers/block/swim3.c:1280:5: error: no previous prototype for 'swim3_init' [-Werror=missing-prototypes]
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two warnings reported by smatch:
drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn:
signedness bug returning '(-95)'
drivers/block/ublk_drv.c:963 ublk_setup_iod() warn:
signedness bug returning '(-5)'
The type of "blk_status_t" is either be a u32 or u8, but this two
functions return a negative value when not supported or failed. Use
the error code of the blk module to fix these warnings.
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202308100201.TCRhgdvN-lkp@intel.com/
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810084836.3535322-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add zoned storage support to ublk: report_zones and operations:
- REQ_OP_ZONE_OPEN
- REQ_OP_ZONE_CLOSE
- REQ_OP_ZONE_FINISH
- REQ_OP_ZONE_RESET
- REQ_OP_ZONE_APPEND
The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to
communicate ALBA back to the kernel. Therefore ublk must be used with the
user copy feature (UBLK_F_USER_COPY) for zoned storage support to be
available. Without this feature, ublk will not allow zoned storage support.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-4-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for zoned storage support, move the check for empty `addr`
field into the command handler case statement. Note that the check makes no
sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be
set for this command.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-3-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will be used by ublk zoned storage support.
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230804114610.179530-2-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit af8b04c637 ("zram: simplify bvec iteration in
__zram_make_request") changed the bio iteration in zram to rely on the
implicit capping to page boundaries in bio_for_each_segment. But it
failed to care for the fact zram not only care about the page alignment
of the bio payload, but also the page alignment into the device. For
buffered I/O and swap those are the same, but for direct I/O or kernel
internal I/O like XFS log buffer writes they can differ.
Fix this by open coding bio_for_each_segment and limiting the bvec len
so that it never crosses over a page alignment boundary in the device
in addition to the payload boundary already taken care of by
bio_iter_iovec.
Cc: stable@vger.kernel.org
Fixes: af8b04c637 ("zram: simplify bvec iteration in __zram_make_request")
Reported-by: Dusty Mabe <dusty@dustymabe.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Link: https://lore.kernel.org/r/20230805055537.147835-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to rbd_try_acquire_lock() effectively swallowing all but
EBLOCKLISTED error from rbd_try_lock() ("request lock anyway") and
rbd_request_lock() returning ETIMEDOUT error not only for an actual
notify timeout but also when the lock owner doesn't respond, a busy
loop inside of rbd_acquire_lock() between rbd_try_acquire_lock() and
rbd_request_lock() is possible.
Requesting the lock on EBUSY error (returned by get_lock_owner_info()
if an incompatible lock or invalid lock owner is detected) makes very
little sense. The same goes for ETIMEDOUT error (might pop up pretty
much anywhere if osd_request_timeout option is set) and many others.
Just fail I/O requests on rbd_dev->acquiring_list immediately on any
error from rbd_try_lock().
Cc: stable@vger.kernel.org # 588159009d: rbd: retrieve and check lock owner twice before blocklisting
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X
4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC
w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq
7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq
IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ
tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8
=rw72
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to reduce the potential for erroneous RBD exclusive lock
blocklisting (fencing) with a couple of prerequisites and a fixup to
prevent metrics from being sent to the MDS even just once after that
has been disabled by the user. All marked for stable"
* tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client:
rbd: retrieve and check lock owner twice before blocklisting
rbd: harden get_lock_owner_info() a bit
rbd: make get_lock_owner_info() return a single locker or NULL
ceph: never send metrics if disable_send_metrics is set
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTD2VcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqzgEADT9lc0fr8wnp2fu/7NhkdsaPtuoUKiH7ju
qxWExXIzxxKYM6OdCdYho9SvIjtdblraZOGqrEni8KMcKnR6vDd8BjGzPJ2BJQe2
GhU0VLVqAw2uoaLhWIZ3m+CyKn7rggoCwpTAUFQG7vL6vSW+ExkWs6g4XcrWhq5V
XCPJnNRc2o6OdFnv1xBNy1YRrP2ME+HaghJdDNxVM4R9hOv4PVvYrW7P2fK1gurD
o3JgTAQg6//16ZBAMcZDkwuGGFVdvG16EiE/0z3QdP4WeDqSFqqxYtk8m/LqAx31
FswUI+YDmKJHSfbLCh6pW4j81xH+svVi0LiANdlxCMGg3zouft+wh/KGdQJC/rcb
ocFgUCOL4P4208wXkuUjtcAI0zmEMPAIoRQirOzDExja0bI4LJrp3GKJ78U4MbDo
iBM5qgkH+MqwD/jXMOY5MnEEBzqceW7Q5Al6emE093kRenIYulEL6ZFy05vfqG1+
NuQcvSJ9f0mgTDUGr0bXytshTG4vSHtKZ9UfAgh5FzPBnPYpwRbxlq6i/7+DxU6k
33FTzNLuRwDSEo33Uj75KwK1glizofYj6lrqYxVw/9jkVAEEzpjr6H8Su1+dTt8L
ZbXSQKdxsAxGdbIMeA1lUfUmqQHi1ouTLsBzwkceKCeN3UXmzGOnNBXmvoVXSv0A
v4RNV+WHCA==
=TD54
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-07-28' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"A few fixes that should go into the current kernel release, mainly:
- Set of fixes for dasd (Stefan)
- Handle interruptible waits returning because of a signal for ublk
(Ming)"
* tag 'block-6.5-2023-07-28' of git://git.kernel.dk/linux:
ublk: return -EINTR if breaking from waiting for existed users in DEL_DEV
ublk: fail to recover device if queue setup is interrupted
ublk: fail to start device if queue setup is interrupted
block: Fix a source code comment in include/uapi/linux/blkzoned.h
s390/dasd: print copy pair message only for the correct error
s390/dasd: fix hanging device after request requeue
s390/dasd: use correct number of retries for ERP requests
s390/dasd: fix hanging device after quiesce/resume
If user interrupts wait_event_interruptible() in ublk_ctrl_del_dev(),
return -EINTR and let user know what happens.
Fixes: 0abe39dec0 ("block: ublk: improve handling device deletion")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ublk_ctrl_end_recovery(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_END_USER_RECOVERY, otherwise kernel oops can be
triggered.
Fixes: c732a852b4 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In ublk_ctrl_start_dev(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_START_DEV, otherwise kernel oops can be triggered.
Reported by German when working on qemu-storage-deamon which requires
single thread ublk daemon.
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reported-by: German Maglione <gmaglione@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An attempt to acquire exclusive lock can race with the current lock
owner closing the image:
1. lock is held by client123, rbd_lock() returns -EBUSY
2. get_lock_owner_info() returns client123 instance details
3. client123 closes the image, lock is released
4. find_watcher() returns 0 as there is no matching watcher anymore
5. client123 instance gets erroneously blocklisted
Particularly impacted is mirror snapshot scheduler in snapshot-based
mirroring since it happens to open and close images a lot (images are
opened only for as long as it takes to take the next mirror snapshot,
the same client instance is used for all images).
To reduce the potential for erroneous blocklisting, retrieve the lock
owner again after find_watcher() returns 0. If it's still there, make
sure it matches the previously detected lock owner.
Cc: stable@vger.kernel.org # f38cb9d9c2: rbd: make get_lock_owner_info() return a single locker or NULL
Cc: stable@vger.kernel.org # 8ff2c64c97: rbd: harden get_lock_owner_info() a bit
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
- we want the exclusive lock type, so test for it directly
- use sscanf() to actually parse the lock cookie and avoid admitting
invalid handles
- bail if locker has a blank address
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Make the "num_lockers can be only 0 or 1" assumption explicit and
simplify the API by getting rid of output parameters in preparation
for calling get_lock_owner_info() twice before blocklisting.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmS629wQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv/7D/99ysE5ZszmjxNOmyy1lGfqtQnaTLuToRsl
wB16umIPAFfye5r4TV8l9GZuUyI7FU8LySglu0Y0qMKmCp+kJKLh90kB281Co4Dn
yp1AbqlTorAlG4ElQJBRaQr4kaqqvI2tzeVmFdUhIE1oX2e9OX/O+YKa8k1JfsKI
oecChQgodlPxX3wusItgiyvZKl2q2+mivg5E6cqiGIgP3uF8fmOQCbio4Vm8ZSxb
TO8JEfBTiXslR+CvJD3Gi96pzexN1qCUed8/7FDiIUufhETmwqSIOo89GxzGAQ6O
7o/83IkqgXPHjKLYs3R4/jhHPXZmXmvDZHWIiSg+KLOFqxxWmRPNJ6V6igIBP8SG
eu5PTA7SDGtvIXePpu38FTPmSiUW7MbGhnjqY8u64Je6MaQ8l28KN7xkFtmxV+n4
hgB0gr6uKBnXMKZHobk0yJeUUI/L/0ESzbVPDHY8JM/rQCsp1eSNQDpZoVjPWZmg
lMGYmOq57oPA20LVch7U3gUFhD4CJ7c3e2/EzJdJVjsTveTYieBCEESQErFbMcEr
VuRZSAGnPyXQ4yF4wG93x4sDye28ZFS/Q9c6Q3DCUxctDkCz4eY1+vmdX+NJXwDA
aYXCyyKzk18udbKvV0QvTuDTb6PrJDPxbFagCveibPTtP4XDMv1LvpdZPUPJ/HGX
4xA1mrsGJA==
=e2OR
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for loop regressions (Mauricio)
- Fix a potential stall with batched wakeups in sbitmap (David)
- Fix for stall with recursive plug flushes (Ross)
- Skip accounting of empty requests for blk-iocost (Chengming)
- Remove a dead field in struct blk_mq_hw_ctx (Chengming)
* tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux:
loop: do not enforce max_loop hard limit by (new) default
loop: deprecate autoloading callback loop_probe()
sbitmap: fix batching wakeup
blk-iocost: skip empty flush bio in iocost
blk-mq: delete dead struct blk_mq_hw_ctx->queued field
blk-mq: Fix stall due to recursive flush plug
Problem:
The max_loop parameter is used for 2 different purposes:
1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()
Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.
However, the default value changed in commit 85c5019771 ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.
That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.
Example:
For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:
mknod("/dev/loopN");
open("/dev/loopN"); // now fails with ENXIO
Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).
Solution:
The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).
This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.
Before 85c5019771:
- default: 1) CONFIG devices 2) no limit
- max_loop=0: 1) CONFIG devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
After 85c5019771:
- default: 1) CONFIG devices 2) CONFIG limit (*)
- max_loop=0: 1) 0 devices (*) 2) no limit
- max_loop=X: 1) X devices 2) X limit
This commit:
- default: 1) CONFIG devices 2) no limit (*)
- max_loop=0: 1) 0 devices 2) no limit
- max_loop=X: 1) X devices 2) X limit
Future:
The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.
Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.
Tests:
Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
- default (original)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- default (patched)
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
#
- max_loop=0 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
#
- max_loop=8 (original & patched):
# ls -1 /dev/loop*
/dev/loop-control
/dev/loop0
...
/dev/loop7
# ./test-loop
open: /dev/loop8: No such device or address
- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)
# ls -1 /dev/loop*
/dev/loop-control
# ./test-loop
open: /dev/loop8: No such device or address
Fixes: 85c5019771 ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.
The loop_probe() function is only used for that callback, so guard it
too, accordingly.
See commit fbdee71bb5 ("block: deprecate autoloading based on dev_t").
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a module alias to nbd.ko that allows the generic netlink core to
automatically load the module when netlink messages for nbd are
received.
This frees the user from manually having to load the module before using
nbd functionality via netlink.
If the system policy allows it this can even be used to load the nbd
module from containers which would otherwise not have access to the
necessary module files to do a normal "modprobe nbd".
For example this avoids the following error when using nbd-client:
$ nbd-client localhost 10809 /dev/nbd0
...
Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Josef Bacik <josef@toxicpadna.com>
Link: https://lore.kernel.org/r/20230713-b4-nbd-genl-v3-1-226cbddba04b@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In virtblk_probe_zoned_device(), execute blk_queue_chunk_sectors() and
blk_queue_max_zone_append_sectors() to respectively set the zoned device
zone size and maximum zone append sector limit before executing
blk_revalidate_disk_zones(). This is to allow the block layer zone
reavlidation to check these device characteristics prior to checking all
zones of the device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-5-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In null_register_zoned_dev(), execute blk_queue_chunk_sectors() and
blk_queue_max_zone_append_sectors() to respectively set the zoned device
zone size and maximum zone append sector limit before executing
blk_revalidate_disk_zones(). This is to allow the block layer zone
reavlidation to check these device characteristics prior to checking all
zones of the device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-4-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Core
----
- Rework the sendpage & splice implementations. Instead of feeding
data into sockets page by page extend sendmsg handlers to support
taking a reference on the data, controlled by a new flag called
MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
to invoke an additional callback instead of trying to predict what
the right combination of MORE/NOTLAST flags is.
Remove the MSG_SENDPAGE_NOTLAST flag completely.
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
- Enable socket busy polling with CONFIG_RT.
- Improve reliability and efficiency of reporting for ref_tracker.
- Auto-generate a user space C library for various Netlink families.
Protocols
---------
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2].
- Use per-VMA locking for "page-flipping" TCP receive zerocopy.
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags.
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative.
- Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
- Avoid waking up applications using TLS sockets until we have
a full record.
- Allow using kernel memory for protocol ioctl callbacks, paving
the way to issuing ioctls over io_uring.
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address.
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch.
- PPPoE: make number of hash bits configurable.
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig).
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge).
- Support matching on Connectivity Fault Management (CFM) packets.
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug.
- HSR: don't enable promiscuous mode if device offloads the proto.
- Support active scanning in IEEE 802.15.4.
- Continue work on Multi-Link Operation for WiFi 7.
BPF
---
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used,
or in fact passing the verifier at all for complex programs,
especially those using open-coded iterators.
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what
the output buffer *should* be, without writing anything.
- Accept dynptr memory as memory arguments passed to helpers.
- Add routing table ID to bpf_fib_lookup BPF helper.
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only).
- Show target_{obj,btf}_id in tracing link fdinfo.
- Addition of several new kfuncs (most of the names are self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter
---------
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value.
- Increase ip_vs_conn_tab_bits range for 64BIT builds.
- Allow updating size of a set.
- Improve NAT tuple selection when connection is closing.
Driver API
----------
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out).
- Support configuring rate selection pins of SFP modules.
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines.
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer.
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio).
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message.
- Split devlink instance and devlink port operations.
New hardware / drivers
----------------------
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers
-------
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is
a variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split
the different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and
Enhanced MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
=9i4I
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
Use sendmsg() conditionally with MSG_SPLICE_PAGES in _drbd_send_page()
rather than calling sendpage() or _drbd_no_send_page().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Philipp Reisner <philipp.reisner@linbit.com>
cc: Lars Ellenberg <lars.ellenberg@linbit.com>
cc: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: drbd-dev@lists.linbit.com
Link: https://lore.kernel.org/r/20230623225513.2732256-11-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the driver core allows for struct class to be in read-only
memory, move the ublk_chr_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-7-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the driver core allows for struct class to be in read-only
memory, move the aoe_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: Justin Sanders <justin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-6-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the driver core allows for struct class to be in read-only
memory, making all 'class' structures to be declared at build time
placing them into read-only memory, instead of having to be dynamically
allocated at load time.
Cc: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230620180129.645646-5-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 07b679f70d.
This change appears to have broken things...
We now see applications hanging during disk accesses.
e.g.
multi-port virtio-blk device running in h/w (FPGA)
Host running a simple 'fio' test.
[global]
thread=1
direct=1
ioengine=libaio
norandommap=1
group_reporting=1
bs=4K
rw=read
iodepth=128
runtime=1
numjobs=4
time_based
[job0]
filename=/dev/vda
[job1]
filename=/dev/vdb
[job2]
filename=/dev/vdc
...
[job15]
filename=/dev/vdp
i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
This is repeatedly run in a loop.
After a few, normally <10 seconds, fio hangs.
With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
Last message:
fio-3.19
Starting 8 threads
Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
I think this means at the end of the run 1 queue was left incomplete.
'diskstats' (run while fio is hung) shows no outstanding transactions.
e.g.
$ cat /proc/diskstats
...
252 0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
252 16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
...
Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
e.g.
PF= 0 vq=0 1 2 3
[a]request_count - 839416590 813148916 105586179 84988123
[b]completion1_count - 839416590 813148916 105586179 84988123
[c]completion2_count - 0 0 0 0
PF= 1 vq=0 1 2 3
[a]request_count - 823335887 812516140 104582672 75856549
[b]completion1_count - 823335887 812516140 104582672 75856549
[c]completion2_count - 0 0 0 0
i.e. the issue is after the virtio-blk driver.
This change was introduced in kernel 6.3.0.
I am seeing this using 6.3.3.
If I run with an earlier kernel (5.15), it does not occur.
If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
e.g.
kernel 5.15 - this is OK
virtio_blk.c,virtblk_done() [irq handler]
if (likely(!blk_should_fake_timeout(req->q))) {
blk_mq_complete_request(req);
}
kernel 6.3.3 - this fails
virtio_blk.c,virtblk_handle_req() [irq handler]
if (likely(!blk_should_fake_timeout(req->q))) {
if (!blk_mq_complete_request_remote(req)) {
if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
}
}
}
If I do, kernel 6.3.3 - this is OK
virtio_blk.c,virtblk_handle_req() [irq handler]
if (likely(!blk_should_fake_timeout(req->q))) {
if (!blk_mq_complete_request_remote(req)) {
virtblk_request_done(req); //force this here...
if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
}
}
}
Perhaps you might like to fix/test/revert this change...
Martin
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306090826.C1fZmdMe-lkp@intel.com/
Cc: Suwan Kim <suwan.kim027@gmail.com>
Tested-by: edliaw@google.com
Reported-by: "Roberts, Martin" <martin.roberts@intel.com>
Message-Id: <336455b4f630f329380a8f53ee8cad3868764d5c.1686295549.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix a missing conversion to the new BLK_OPEN constant in swim.
Fixes: 05bdb99653 ("block: replace fmode_t with a block-specific type for block open flags")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620043051.707196-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recompression threshold should be below huge-size-class watermark. Any
object larger than huge-size-class is a "huge object" and occupies a
whole physical page on the zsmalloc side, in other words it's
incompressible, as far as zsmalloc is concerned.
Link: https://lkml.kernel.org/r/20230614141338.3480029-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Brian Geffon <bgeffon@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The body of the loop is run without RCU lock held. Use the regular
cond_resched() instead of cond_resched_rcu().
Fixes: 786bb02458 ("brd: use XArray instead of radix-tree to index backing pages")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stop passing the fmode_t around and just use a simple bool to track if
an export is read-only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230608110258.189493-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230608110258.189493-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mode argument to the ->release block_device_operation is never used,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->open is only called on the whole device. Make that explicit by
passing a gendisk instead of the block_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bdev_check_media_change should only ever be called for the whole device.
Pass a gendisk to make that explicit and rename the function to
disk_check_media_change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Let's always set errno after pr_err which is consistent with
default case.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-7-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Check ret is enough since if sess_dev is NULL which also
implies ret should be 0.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-5-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add one new array (marked with __maybe_unused to prevent gcc warning about
"defined but not used" with W=1), then we can remove rnbd_access_mode_str
and rnbd-common.c accordingly.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230524070026.2932-4-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to include it since none of macros in limits.h are
used by rnbd-srv.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-3-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This routine is not called since added. Then the two flags
(RNBD_OP_LAST and RNBD_F_ALL) can be removed too after kill
rnbd_flags_supported.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-2-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSDhmUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsC6D/9SZA3bcl3azUAUNPXA0uAXyZufBL/FuR83
SNpaRp00epbF00NpPvrsMCE79RyGNK+JNlpTu8gBWssnKZ3Hyu0Adu7ILdYx4raC
H9YT8HeMUuT12bi397HIe7Bh5Qxpkta/F1aGgAp9O0fCVsYAGxfT+YS+KGcIsomI
L3IZjrwN4F4+zwEl++qHIVvidiz5hTahr2rVaYdX5TsogeP31x70Xp1WDSUhYZ4u
d3k534vwy/xNRVLqJKhRLfcKW6qqrydTflqtOV0z8fuV+Pf4wxxBaODYWxXPKwqn
JdWsFPb7339SsN/PZccdXhzCUJGr6cL+6b6holgSUKt2g+LoBSmiUZPocUx+A3qB
MHgww2bAK7BD5GiLuTHS+8xWNr99m5LkpOGTsWzsrnDe4c/YIL9yTdQIm58pmp25
8oOm/fmAVnwtg0SEmjX747YkFLnX0H+UQ10iIxynju47vEbpOCiuy8vM+g1ccvSq
e/FLpfp1JzbNJi6zGw/EU1ZiuHp+YlkgqxjUITHv7tSA4hl3zN23zlvHJ2lBxDKF
Talf2HLLgmvZX5X5hgQyTXeHGmxa/WfLWEyKSoy7OKgD0e/pM5GuLsx6Nmg4kMgr
ru5Y+UrlMR/8KJUVvXpTigiESpmQufy+S5CjvQtLD1lhbIfbypL4yTdve/6weFNh
aQI+6rpxtw==
=fNEf
-----END PGP SIGNATURE-----
Merge tag 'block-6.4-2023-06-09' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix an issue with the hardware queue nr_active, causing it to become
imbalanced (Tian)
- Fix an issue with null_blk not releasing pages if configured as
memory backed (Nitesh)
- Fix a locking issue in dasd (Jan)
* tag 'block-6.4-2023-06-09' of git://git.kernel.dk/linux:
s390/dasd: Use correct lock while counting channel queue length
null_blk: Fix: memory release when memory_backed=1
blk-mq: fix blk_mq_hw_ctx active request accounting
Sort the headers in alphabetic order in order to ease
the maintenance for this part.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-10-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the snippets like the following
if (...)
return / goto / break / continue ...;
else
...
the 'else' is redundant. Get rid of it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-9-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use DEFINE_SHOW_ATTRIBUTE() helper macro to simplify the code.
No functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-7-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the commit 72deb455b5 ("block: remove CONFIG_LBDAF")
the sector_t is always 64-bit type, no need to cast anymore.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code can be neater without forward declarations.
Get rid of pkt_seq_show() forward declaration. This
will also allow futher cleanups to be cleaner.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We may use traditional dev_*() macros instead of custom ones
provided by the driver.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the index allocated by idr_alloc greater than MINORMASK >> part_shift,
the device number will overflow, resulting in failure to create a block
device.
Fix it by imiting the size of the max allocation.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230605122159.2134384-1-zhongjinghua@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move capturing the snapshot context into the image request state
machine, after exclusive lock is ensured to be held for the duration of
dealing with the image request. This is needed to ensure correctness
of fast-diff states (OBJECT_EXISTS vs OBJECT_EXISTS_CLEAN) and object
deltas computed based off of them. Otherwise the object map that is
forked for the snapshot isn't guaranteed to accurately reflect the
contents of the snapshot when the snapshot is taken under I/O. This
breaks differential backup and snapshot-based mirroring use cases with
fast-diff enabled: since some object deltas may be incomplete, the
destination image may get corrupted.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61472
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting into the object request
state machine to allow for the snapshot context to be captured in the
image request state machine rather than in rbd_queue_workfn().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims. It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__KERNEL_SYSCALLS__ hasn't been needed since Linux 2.6.19 so stop
defining it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230601151646.1386867-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add control command of UBLK_U_CMD_GET_FEATURES for returning driver's
feature set or capability.
This way can simplify userspace for maintaining compatibility because
userspace doesn't need to send command to one device for querying driver
feature set any more. Such as, with the queried feature set, userspace
can choose to use:
- UBLK_CMD_GET_DEV_INFO2 or UBLK_CMD_GET_DEV_INFO,
- UBLK_U_CMD_* or UBLK_CMD_*
Userspace code:
https://github.com/ming1/ubdsrv/commits/features-cmd
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230603040601.775227-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The floppy code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/33c445a3b431270c72d9be03d5da1b08ae983920.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The zram writeback code uses bio_add_page() to add a page to a newly
created bio. bio_add_page() can fail, but the return value is never
checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/cfd141dd7773315879a126f2aa81b7f698bc0e10.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The drbd code only adds a single page to a newly created bio. So use
__bio_add_page() to add the page which is guaranteed to succeed in this
case.
This brings us closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/435007afac14f3766455559059d21843771fae53.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZHGVRAAKCRCAXGG7T9hj
vqtJAQDizKasLE7tSnfs/FrZ/4xPaDLe3bQifMx2C1dtYCjRcAD/ciZSa1L0WzZP
dpEZnlYRzsR3bwLktQEMQFOvlbh1SwE=
=K860
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a double free fix in the Xen pvcalls backend driver
- a fix for a regression causing the MSI related sysfs entries to not
being created in Xen PV guests
- a fix in the Xen blkfront driver for handling insane input data
better
* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pci/xen: populate MSI sysfs entries
xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
xen/blkfront: Only check REQ_FUA for writes
The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.
While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerwall@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Return type of iov_iter_get_pages2() is ssize_t instead of size_t, so
fix it.
Fixes: 981f95a571 ("ublk: cleanup ublk_copy_user_pages")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230520151134.459679-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently copy between io request buffer(pages) and userspace buffer is
done inside ublk_map_io() or ublk_unmap_io(). This way performs very
well in case of pre-allocated userspace io buffer.
For dynamically allocated or external userspace backend io buffer,
UBLK_F_NEED_GET_DATA is added for ublk server to provide buffer by one
extra command communication for WRITE request. For READ, userspace
simply provides buffer, but can't know when the buffer is done[1].
Add UBLK_F_USER_COPY by moving io data copy out of kernel by providing
read()/write() on /dev/ublkcN, and simply let ublk server do the io
data copy. This way makes both side cleaner, the cost is that one extra
syscall for copy io data between request and backend buffer.
With UBLK_F_USER_COPY, it actually becomes possible to run per-io zero
copy now, such as, only do zero copy for big size IO, so it can be
thought as one prep patch for supporting zero copy. Meantime zero copy
still needs to expose read()/write() buffer for some corner case, such
as passthrough IO.
[1] READ buffer in UBLK_F_NEED_GET_DATA
https://lore.kernel.org/linux-block/116d8a56-0881-56d3-9bcc-78ff3e1dc4e5@linux.alibaba.com/T/#m23bd4b8634c0a054e6797063167b469949a247bb
ublksrv loop usercopy code:
https://github.com/ming1/ubdsrv/commits/usercopy
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Support pread()/pwrite() on ublk char device for reading/writing request
io buffer, so data copy between io request buffer and userspace buffer
can be moved to ublk server from ublk driver. Then UBLK_F_NEED_GET_DATA
becomes not necessary, so ublk server can allocate buffer without one
extra round uring command communication for userspace to provide buffer.
IO buffer can be located by iocb->ki_pos which encodes buffer offset, io
tag and queue id info, and type of iocb->ki_pos is u64, so it is big
enough for holding reasonable queue depth, nr_queues and max io buffer
size.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add 'offset' to 'struct ublk_map_data', so that ublk_copy_user_pages()
can be used to copy any sub-buffer(linear mapped) of the request.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add one reference counter into request pdu data, and hold this reference
in the request's lifetime.
Prepare for supporting to move request data copy into userspace, which
needs to copy request data by read()/write() on /dev/ublkcN, so we have
to guarantee that read()/write() is done on one valid/active request,
and that will be enhanced by holding the io request reference in
read()/write().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Clean up ublk_copy_user_pages() by using iov_iter_get_pages2, and code
gets simplified a lot and becomes much more readable than before.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work_add() is used from early ublk development stage for handling
request in batch. However, since commit 7d4a93176e ("ublk_drv: don't
forward io commands in reserve order"), we can get similar batch
processing with io_uring_cmd_complete_in_task(), and similar performance
data is observed between task_work_add() and
io_uring_cmd_complete_in_task().
Meantime we can kill one fast code path, which is actually seldom used
given it is common to build ublk driver as module.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230519065030.351216-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When handling UBLK_IO_FETCH_REQ, ctx->uring_lock is grabbed first, then
ub->mutex is acquired.
When handling UBLK_CMD_STOP_DEV or UBLK_CMD_DEL_DEV, ub->mutex is
grabbed first, then calling io_uring_cmd_done() for canceling uring
command, in which ctx->uring_lock may be required.
Real deadlock only happens when all the above commands are issued from
same uring context, and in reality different uring contexts are often used
for handing control command and IO command.
Fix the issue by using io_uring_cmd_complete_in_task() to cancel command
in ublk_cancel_dev(ublk_cancel_queue).
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/becol2g7sawl4rsjq2dztsbc7mqypfqko6wzsyoyazqydoasml@rcxarzwidrhk
Cc: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20230517133408.210944-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
XArray was introduced to hold large array of pointers with a simple API.
XArray API also provides array semantics which simplifies the way we store
and access the backing pages, and the code becomes significantly easier
to understand.
No performance difference was noticed between the two implementation
using fio with direct=1 [1].
[1] Performance in KIOPS:
| radix-tree | XArray | Diff
| | |
write | 315 | 313 | -0.6%
randwrite | 286 | 290 | +1.3%
read | 330 | 335 | +1.5%
randread | 309 | 312 | +0.9%
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230511121544.111648-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of CONFIG_BLKDEV_UBLK_LEGACY_OPCODES, type of cmd opcode could
be 0 or 'u'; and type can only be 'u' if CONFIG_BLKDEV_UBLK_LEGACY_OPCODES
isn't set.
So fix the wrong check.
Fixes: 2d786e66c9 ("block: ublk: switch to ioctl command encoding")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230505153142.1258336-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The debugfs_create_dir function returns ERR_PTR in case of error, and the
only correct way to check if an error occurred is 'IS_ERR' inline function.
This patch will replace the null-comparison with IS_ERR.
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Link: https://lore.kernel.org/r/20230512130533.98709-1-ivan.orlov0322@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRXkp8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsH3EADZ4ex4vvc2/giYjRYTdyda00GIdYGiSZaZ
OBW59OBYJfDjsS99Z2U4638shyqM3rA6BwzvWD6hPUTShD7PUddIKDaanpCYBs+x
GhcDhzeJxARjsp/5qZPQ4r3Hdp+kWRnOFu1BSniCEdWt43J6AOm0l5duuNxl1db5
Jz+BkSgF1pICAVB5YfSPDShyFiDfc0qcv3rHZo/rVRjwmX3x4x+OVr97zbZInnH4
fm2vcv2e3Dc8uQEB0XFEyBWNeC8hdhTNe60t1y3MlILaPjHrQlzl5lkQFZ9dtLG6
/FDK+bD6Ex8t6SQnnQGrLa9q4vQYA6zlT5u/wdvzAkakmrSxt6phf23U1TbyZSQ8
clSg6MSuyJAW7YhiT3OqAsNMvBZ5g/nKHjLxfkAe86N9YIrMebZW9GDF+fWf0q0K
em7O4XPehjkCXKclCInMv2bkv+/pncrd84D7Jt3OmCPljBL6oXfWF3vJJQ6FaY+t
G3tNxl1hMHqwX8uaaSqyeuMYAEYizZa+ebMmHJwm/151d0EKylAOh3i6O1e4bpk3
Kn5irF6pfTV1g1TcmY3fOwNgMffcHc9Y4E8bXxMJAN9WiHDuopgn2C9+MrJ+KRrx
pLN7VD9UK/5yFzHSMBhyM8lm0/oTIhfTVgb+WyuHmJLcRhAnyNy9RrIBhSD0AoH+
UMyOexZR8w==
=l7Mo
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"Nothing major in here, just two different parts:
- A small series from Breno that enables passing the full SQE down
for ->uring_cmd().
This is a prerequisite for enabling full network socket operations.
Queued up a bit late because of some stylistic concerns that got
resolved, would be nice to have this in 6.4-rc1 so the dependent
work will be easier to handle for 6.5.
- Fix for the huge page coalescing, which was a regression introduced
in the 6.3 kernel release (Tobias)"
* tag 'for-6.4/io_uring-2023-05-07' of git://git.kernel.dk/linux:
io_uring: Remove unnecessary BUILD_BUG_ON
io_uring: Pass whole sqe to commands
io_uring: Create a helper to return the SQE size
io_uring/rsrc: check for nonconsecutive pages
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWLQYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnwhD/4xRYfAY4O0oGZCITiKEEFxiSfDHCPMEBji
zTEtideDRjcrpFmjmZ411C9prW32MxYQ3wQTf4O7w4t906xTYVr9FQy8g3Et4izI
zglPcsa2jPzYCadQ0Ye4n9dRuYOH9FDJzDDLC+smu8zQKKmAqEAN/1ftpnADdVrY
qX1sHyfz4RQRAgTHg2WqgKOi2O9VwGSMwOfBocmzAnruv9oLUlypcGnFPRCSZH/a
OKpUZlvQhCBZTKScvVxQeJMg2Tl5yokQ0TH+gkQsdav9XcPJktqXiuD+c4h6q5ux
oTlysEqcrwcaAafOcV0w9u80SFNlFACUYsNmEnFJaPXFTqAdNHvo1DJNsmxiHJDU
bGo5ktlo5b/VZ51niOoWvGxursavq16G4yIYlGGHc7f4wGs12oc5ZP/yM3GRUY+C
PdezwEvvQufxP7sFokfpgAS4SuH+tBlrhFXMsYaI4NukZQW4TK1zzbMrzOkdxFhW
BOx17VFUKWtUnRmxinFGIA8Vj+FXN+E+ND+FoDsbrMyJD4maKDdJapPchG0J0Vbs
pDcsB4c0pBC6H2xrobKiA1CuSq2t2qvyvwe1Zl2Xd+RVW9vBB5SI6HXYrC+UtxwY
7LfX8F13cFD1E6iJ9Nta6x8fOunGnOVBdW5O0k4hDWEuZduvHItEDn2c3Ehqp4Jw
P8dFBbk8SQ==
=gAYf
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Improve raid5 sequential IO performance on spinning disks, which
fixes a regression since v6.0 (Jan Kara)
- Fix bitmap offset types, which fixes an issue introduced in this
merge window (Jonathan Derrick)
- Cleanup of hweight type used for cgroup writeback (Maxim)
- Fix a regression with the "has_submit_bio" changes across partitions
(Ming)
- Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.
We used to set this flag on queues non blk-mq queues, and hence some
drivers clear it unconditionally. Since all of these have since been
converted to true blk-mq drivers, drop the useless clear as the bit
is not set (Chaitanya)
- Fix the flags being set in a bio for a flush for drbd (Christoph)
- Cleanup and deduplication of the code handling setting block device
capacity (Damien)
- Fix for ublk handling IO timeouts (Ming)
- Fix for a regression in blk-cgroup teardown (Tao)
- NBD documentation and code fixes (Eric)
- Convert blk-integrity to using device_attributes rather than a second
kobject to manage lifetimes (Thomas)
* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
ublk: add timeout handler
drbd: correctly submit flush bio on barrier
mailmap: add mailmap entries for Jens Axboe
block: Skip destroyed blkg when restart in blkg_destroy_all()
writeback: fix call of incorrect macro
md: Fix bitmap offset type in sb writer
md/raid5: Improve performance for sequential IO
docs nbd: userspace NBD now favors github over sourceforge
block nbd: use req.cookie instead of req.handle
uapi nbd: add cookie alias to handle
uapi nbd: improve doc links to userspace spec
blk-integrity: register sysfs attributes on struct device
blk-integrity: convert to struct device_attribute
blk-integrity: use sysfs_emit
block/drivers: remove dead clear of random flag
block: sync part's ->bd_has_submit_bio with disk's
block: Cleanup set_capacity()/bdev_set_nr_sectors()
Currently uring CMD operation relies on having large SQEs, but future
operations might want to use normal SQE.
The io_uring_cmd currently only saves the payload (cmd) part of the SQE,
but, for commands that use normal SQE size, it might be necessary to
access the initial SQE fields outside of the payload/cmd block. So,
saves the whole SQE other than just the pdu.
This changes slightly how the io_uring_cmd works, since the cmd
structures and callbacks are not opaque to io_uring anymore. I.e, the
callbacks can look at the SQE entries, not only, in the cmd structure.
The main advantage is that we don't need to create custom structures for
simple commands.
Creates io_uring_sqe_cmd() that returns the cmd private data as a null
pointer and avoids casting in the callee side.
Also, make most of ublk_drv's sqe->cmd priv structure into const, and use
io_uring_sqe_cmd() to get the private structure, removing the unwanted
cast. (There is one case where the cast is still needed since the
header->{len,addr} is updated in the private structure)
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20230504121856.904491-3-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add timeout handler, so that we can provide forward progress guarantee for
unprivileged ublk, which can't be trusted.
One thing is that sync() calls sync_bdevs(wait) for all block devices after
running sync_bdevs(no_wait), and if one device can't move on, the sync() won't
return any more.
Add timeout for unprivileged ublk to avoid such affect for other users which call
sync() syscall.
Meantime clear UBLK_F_USER_RECOVERY_REISSUE for unprivileged ublk since
that feature may cause IO hang too.
Fixes: 4093cb5a06 ("ublk_drv: add mechanism for supporting unprivileged ublk device")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230502024231.888498-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we receive a flush command (or "barrier" in DRBD), we currently use
a REQ_OP_FLUSH with the REQ_PREFLUSH flag set.
The correct way to submit a flush bio is by using a REQ_OP_WRITE without
any data, and set the REQ_PREFLUSH flag.
Since commit b4a6bb3a67 ("block: add a sanity check for non-write
flush/fua bios"), this triggers a warning in the block layer, but this
has been broken for quite some time before that.
So use the correct set of flags to actually make the flush happen.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kernel.org
Fixes: f9ff0da564 ("drbd: allow parallel flushes for multi-volume resources")
Reported-by: Thomas Voegtle <tv@lio96.de>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230503121937.17232-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
The NBD spec was recently changed [1] to refer to the opaque client
identifier as a 'cookie' rather than a 'handle', but has for a much
longer time listed it as a 64-bit value, and declares that all values
in the NBD protocol are sent in network byte order (big-endian).
Because the value is opaque to the server, it doesn't usually matter
what endianness we send as the client - as long as we are consistent
that either we byte-swap on both write and read, or on neither, then
we can match server replies back to our requests. That said, our
internal use of the cookie is as a 64-bit number (well, as two 32-bit
numbers concatenated together), rather than as 8 individual bytes; so
prior to this commit, we ARE leaking the native endianness of our
internals as a client out to the server. We don't know of any server
that will actually inspect the opaque value and behave differently
depending on whether a little-endian or big-endian client is sending
requests, but since we DO log the cookie value, a wireshark capture of
the network traffic is easier to correlate back to the kernel traffic
of a big-endian host (where the u64 and char[8] representations are
the same) than of a little-endian host (where if wireshark honors the
NBD spec and displays a u64 in network byte order, it is byte-swapped
from what the kernel logged).
The fix in this patch is thus two-part: it now consistently uses
network byte order for the opaque value (no difference to a big-endian
machine, but an extra byteswap on a little-endian machine; probably in
the noise compared to the overhead of network traffic in general), and
now uses a 64-bit integer instead of char[8] as its preferred access
to the opaque value (direct assignment instead of memcpy()).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230410180611.1051618-4-eblake@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZEolJQAKCRCAXGG7T9hj
vuVMAP9B3WLzszen3/XCM2E6sZurtmD+YPkUrbES2AsEE1PH3gEA73ZxM1C+gvKS
5be7Dksgeyqyqrwhb9/VOyHU3pmyrAw=
=zirQ
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- some cleanups in the Xen blkback driver
- fix potential sleeps under lock in various Xen drivers
* tag 'for-linus-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/blkback: move blkif_get_x86_*_req() into blkback.c
xen/blkback: simplify free_persistent_gnts() interface
xen/blkback: remove stale prototype
xen/blkback: fix white space code style issues
xen/pvcalls: don't call bind_evtchn_to_irqhandler() under lock
xen/scsiback: don't call scsiback_free_translation_entry() under lock
xen/pciback: don't call pcistub_device_put() under lock
Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening in
the driver core in the quest to be able to move "struct bus" and "struct
class" into read-only memory, a task now complete with these changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules for
all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most of
them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
LEGadNS38k5fs+73UaxV
=7K4B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening
in the driver core in the quest to be able to move "struct bus" and
"struct class" into read-only memory, a task now complete with these
changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules
for all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most
of them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
device property: make device_property functions take const device *
driver core: update comments in device_rename()
driver core: Don't require dynamic_debug for initcall_debug probe timing
firmware_loader: rework crypto dependencies
firmware_loader: Strip off \n from customized path
zram: fix up permission for the hot_add sysfs file
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
tty: make tty_class a static const structure
driver core: class: remove struct class_interface * from callbacks
driver core: class: mark the struct class in struct class_interface constant
driver core: class: make class_register() take a const *
driver core: class: mark class_release() as taking a const *
driver core: remove incorrect comment for device_create*
MIPS: vpe-cmp: remove module owner pointer from struct class usage.
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvcIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpk+JEACj01t7Xen2+Razagu3aTx9tmRGFnTNR3MY
raFG6B1TADk1TgCWWa2C4Dj67SOispPLm8hbIcOxqB1UscDWCCwjmnr/debADFzW
Ap6shv/IRwVGmDp+F7ocYas0ynwooOJg4WJTwkSKz2o4m4p3vzlwAKi4fLiSjbXp
gJTrA7WEvDOVjzajlTFUtjr8rc6PdunbGm25cPIufAxUEhvttYex2VbVqjDmfNsE
8tyyk9RWbe4AY/ZYaGXVn4yQ/CgL/sXFkVc5noRXNfAQ/K3CVLQrFLJ3JlwUHpiA
xXBor21TUWCZEo33Y2G5NConAYqE7etoPTkaTDO3/aZ+dAMFyhC/WAYLz1KZGMh1
+g1fDX1QKEd40H2lfDXvqF1ob7Ut8EzUx+gvBXcc3/AiRpJ5rjfOcj6LPUMUqQJk
nucLLFTiMKecnDMBERbvixqbaTyrjvkFEj2wYJvgj1LKXAd+x/bj8SGajs9r88Nb
9YT9ai/+Yl7Ppfb67rCgXJU7oNZQSAQ2H+X/l2jbiqImOgq1u/45AmINnbanS7HH
Y1I8pbH45AcnCgkJRoQwrNX3BnTOTBJ+D/4Fl4b8jsihq0D3UtwCwPCObHP4LW9S
MUNPhP3tUuYsAgXqX80+Sao6SYvXDwnbWOM+LOaaZXgjb1ndwDUZXpto8Ra8WB1u
8kM6s6ZR7g==
=W1Zb
-----END PGP SIGNATURE-----
Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- drbd patches, bringing us closer to unifying the out-of-tree version
and the in tree one (Andreas, Christoph)
- support for auto-quiesce for the s390 dasd driver (Stefan)
- MD pull request via Song:
- md/bitmap: Optimal last page size (Jon Derrick)
- Various raid10 fixes (Yu Kuai, Li Nan)
- md: add error_handlers for raid0 and linear (Mariusz Tkaczyk)
- NVMe pull request via Christoph:
- Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
- Validate nvmet module parameters (Chaitanya Kulkarni)
- Fence TCP socket on receive error (Chris Leech)
- Fix async event trace event (Keith Busch)
- Minor cleanups (Chaitanya Kulkarni, zhenwei pi)
- Fix and cleanup nvmet Identify handling (Damien Le Moal,
Christoph Hellwig)
- Fix double blk_mq_complete_request race in the timeout handler
(Lei Yin)
- Fix irq locking in nvme-fcloop (Ming Lei)
- Remove queue mapping helper for rdma devices (Sagi Grimberg)
- use structured request attribute checks for nbd (Jakub)
- fix blk-crypto race conditions between keyslot management (Eric)
- add sed-opal support for reading read locking range attributes
(Ondrej)
- make fault injection configurable for null_blk (Akinobu)
- clean up the request insertion API (Christoph)
- clean up the queue running API (Christoph)
- blkg config helper cleanups (Tejun)
- lazy init support for blk-iolatency (Tejun)
- various fixes and tweaks to ublk (Ming)
- remove hybrid polling. It hasn't really been useful since we got
async polled IO support, and these days we don't support sync polled
IO at all (Keith)
- misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming,
Chaitanya, me)
* tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits)
nbd: fix incomplete validation of ioctl arg
ublk: don't return 0 in case of any failure
sed-opal: geometry feature reporting command
null_blk: Always check queue mode setting from configfs
block: ublk: switch to ioctl command encoding
blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush
block, bfq: Fix division by zero error on zero wsum
fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m
block: store bdev->bd_disk->fops->submit_bio state in bdev
block: re-arrange the struct block_device fields for better layout
md/raid5: remove unused working_disks variable
md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
md/raid10: fix memleak of md thread
md/raid10: fix memleak for 'conf->bio_split'
md/raid10: fix leak of 'r10bio->remaining' for recovery
md/raid10: don't BUG_ON() in raise_barrier()
md: fix soft lockup in status_resync
md: add error_handlers for raid0 and linear
md: Use optimal I/O size for last bitmap page
md: Fix types in sb writer
...
These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
=H3k4
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release"
* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
Kconfig: introduce HAS_IOPORT option and select it as necessary
scripts: Update the CONFIG_* ignore list in headers_install.sh
pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
Move bp_type_idx to include/linux/hw_breakpoint.h
Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
QUEUE_FLAG_ADD_RANDOM is not set before we clear it for "null_blk",
"brd", "nbd", "zram", and "bcache" since by default we don't set
"QUEUE_FLAG_ADD_RANDOM" to MQ ops.
Remove dead clear of QUEUE_FLAG_ADD_RANDOM in above listed drivers.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> #zram
Link: https://lore.kernel.org/r/20230424234628.45544-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to have the functions blkif_get_x86_32_req() and
blkif_get_x86_64_req() in a header file, as they are used in one place
only.
So move them into the using source file and drop the inline qualifier.
While at it fix some style issues, and simplify the code by variable
reusing and using min() instead of open coding it.
Instead of using barrier() use READ_ONCE() for avoiding multiple reads
of nr_segments.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
The interface of free_persistent_gnts() can be simplified, as there is
only a single caller of free_persistent_gnts() and the 2nd and 3rd
parameters are easily obtainable via the ring pointer, which is passed
as the first parameter anyway.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
There is no function xen_blkif_purge_persistent(), so remove its
prototype from common.h.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
We tested and found an alarm caused by nbd_ioctl arg without verification.
The UBSAN warning calltrace like below:
UBSAN: Undefined behaviour in fs/buffer.c:1709:35
signed integer overflow:
-9223372036854775808 - 1 cannot be represented in type 'long long int'
CPU: 3 PID: 2523 Comm: syz-executor.0 Not tainted 4.19.90 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x170/0x1dc lib/dump_stack.c:118
ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
handle_overflow+0x188/0x1dc lib/ubsan.c:192
__ubsan_handle_sub_overflow+0x34/0x44 lib/ubsan.c:206
__block_write_full_page+0x94c/0xa20 fs/buffer.c:1709
block_write_full_page+0x1f0/0x280 fs/buffer.c:2934
blkdev_writepage+0x34/0x40 fs/block_dev.c:607
__writepage+0x68/0xe8 mm/page-writeback.c:2305
write_cache_pages+0x44c/0xc70 mm/page-writeback.c:2240
generic_writepages+0xdc/0x148 mm/page-writeback.c:2329
blkdev_writepages+0x2c/0x38 fs/block_dev.c:2114
do_writepages+0xd4/0x250 mm/page-writeback.c:2344
The reason for triggering this warning is __block_write_full_page()
-> i_size_read(inode) - 1 overflow.
inode->i_size is assigned in __nbd_ioctl() -> nbd_set_size() -> bytesize.
We think it is necessary to limit the size of arg to prevent errors.
Moreover, __nbd_ioctl() -> nbd_add_socket(), arg will be cast to int.
Assuming the value of arg is 0x80000000000000001) (on a 64-bit machine),
it will become 1 after the coercion, which will return unexpected results.
Fix it by adding checks to prevent passing in too large numbers.
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20230206145805.2645671-1-zhongjinghua@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 2d786e66c9 ("block: ublk: switch to ioctl command encoding")
starts to reset local variable of 'ret' as zero, then if any failure
happens when handling the three IO commands, 0 can be returned to ublk
server.
Fix it by returning -EINVAL in case of command handling failure.
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 2d786e66c9 ("block: ublk: switch to ioctl command encoding")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230420091104.1092972-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All ublk commands(control, IO) should have taken ioctl command encoding
from the beginning, because ioctl command encoding defines each code
uniquely, so driver can figure out wrong command sent from userspace
easily; 2) it might help security subsystem for audit uring cmd[1].
Unfortunately we didn't do that way, and it could be one lesson for
ublk driver.
So switch to ioctl command encoding now, we still support commands encoded
in old way, but they become legacy definition. Any new command should take
ioctl encoding.
See ublksrv code for switching to ioctl command encoding in [2].
[1] https://lore.kernel.org/io-uring/CAHC9VhSVzujW9LOj5Km80AjU0EfAuukoLrxO6BEfnXeK_s6bAg@mail.gmail.com/
[2] https://github.com/ming1/ubdsrv/commits/ioctl_cmd_encoding
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ken Kurematsu <k.kurematsu@nskint.co.jp>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230418131810.855959-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Propagate read errors to the caller instead of dropping them on the floor,
and stop returning the somewhat dangerous 1 on success from
read_from_bdev*.
Link: https://lkml.kernel.org/r/20230411171459.567614-18-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently nothing waits for the synchronous reads before accessing the
data. Switch them to an on-stack bio and submit_bio_wait to make sure the
I/O has actually completed when the work item has been flushed. This also
removes the call to page_endio that would unlock a page that has never
been locked.
Drop the partial_io/sync flag, as chaining only makes sense for the
asynchronous reads of the entire page.
Link: https://lkml.kernel.org/r/20230411171459.567614-17-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bio_alloc will never return a NULL bio when it is allowed to sleep, and
adding a single page to bio with a single vector also can't fail, so
switch to the asserting __bio_add_page variant and drop the error returns.
Link: https://lkml.kernel.org/r/20230411171459.567614-16-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
read_from_bdev always reads a whole page, so pass a page to it instead of
the bvec and remove the now pointless zram_bvec_read_from_bdev wrapper.
Link: https://lkml.kernel.org/r/20230411171459.567614-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Split the read/modify/write case into a separate helper.
Link: https://lkml.kernel.org/r/20230411171459.567614-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__zram_bvec_write only extracts the page from __zram_bvec_write and always
expects a full page of input. Pass the page directly instead of the bvec
and rename the function to zram_write_page.
Link: https://lkml.kernel.org/r/20230411171459.567614-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Split the partial read into a separate helper.
Link: https://lkml.kernel.org/r/20230411171459.567614-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
writeback_store always reads a full page, so just call zram_read_page
directly and bypass the boune buffer handling.
Link: https://lkml.kernel.org/r/20230411171459.567614-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__zram_bvec_read doesn't get passed a bvec, but always read a whole page.
Rename it to make the usage more clear.
Link: https://lkml.kernel.org/r/20230411171459.567614-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no point in allocation a highmem page when we instantly need to
copy from it.
Link: https://lkml.kernel.org/r/20230411171459.567614-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of having an outer loop in __zram_make_request and then branch out
for reads vs writes for each loop iteration in zram_bvec_rw, split the
main handler into separat zram_bio_read and zram_bio_write handlers that
also include the functionality formerly in zram_bvec_rw.
Link: https://lkml.kernel.org/r/20230411171459.567614-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the low-level access fails, don't clear the idle flag or clear the
caches, and just return.
Link: https://lkml.kernel.org/r/20230411171459.567614-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch on the bio operation in zram_submit_bio and only call into
__zram_make_request for read and write operations.
Link: https://lkml.kernel.org/r/20230411171459.567614-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bio_for_each_segment synthetize bvecs that never cross page boundaries, so
don't duplicate that work in an inner loop.
Link: https://lkml.kernel.org/r/20230411171459.567614-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Derive the index and offset variables inside the function, and complete
the bio directly in preparation for cleaning up the I/O path.
Link: https://lkml.kernel.org/r/20230411171459.567614-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All bios hande to drivers from the block layer are checked against the
device size and for logical block alignment already (and have been since
long before zram was merged), so don't duplicate those checks.
Link: https://lkml.kernel.org/r/20230411171459.567614-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram I/O path cleanups and fixups", v3.
This series cleans up the zram I/O path, and fixes the handling of
synchronous I/O to the underlying device in the writeback_store function
or for > 4K PAGE_SIZE systems.
The fixes are at the end, as I could not fully reason about them being
safe before untangling the callchain.
This patch (of 17):
read_from_bdev_sync is currently only compiled for non-4k PAGE_SIZE, which
means it won't be built with the most common configurations.
Replace the ifdef with a check for the PAGE_SIZE in an if instead. The
check uses an extra symbol and IS_ENABLED to allow the compiler to
eliminate the dead code, leading to the same generated code size:
text data bss dec hex filename
16709 1428 12 18149 46e5 drivers/block/zram/zram_drv.o.old
16709 1428 12 18149 46e5 drivers/block/zram/zram_drv.o.new
Link: https://lkml.kernel.org/r/20230411171459.567614-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230411171459.567614-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 75a2d4226b ("driver core: class: mark the struct class for
sysfs callbacks as constant") changed the attribute to use
CLASS_ATTR_RO() which changed the permission from 0400 to 0444. But
this atribute is "special" in that reading it modifies the system state,
so it MUST be set to 0400 so that only root processes can muck around
with it.
Fix this all up, AND document this so that I don't change it again in
3-4 years when I stumble across it and wonder why it's an open-coded
_ATTR() macro.
Reported-by: Denis Efremov <efremov@linux.com>
Fixes: 75a2d4226b ("driver core: class: mark the struct class for sysfs callbacks as constant")
Link: https://lore.kernel.org/r/2023041810-angelic-conical-52d8@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The null_blk driver has multiple driver-specific fault injection
mechanisms. Each fault injection configuration can only be specified by a
module parameter and cannot be reconfigured without reloading the driver.
Also, each configuration is common to all devices and is initialized every
time a new device is added.
This change adds the following subdirectories for each null_blk device.
/sys/kernel/config/nullb/<disk>/timeout_inject
/sys/kernel/config/nullb/<disk>/requeue_inject
/sys/kernel/config/nullb/<disk>/init_hctx_fault_inject
Each fault injection attribute can be dynamically set per device by a
corresponding file in these directories.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Link: https://lore.kernel.org/r/20230327143733.14599-3-akinobu.mita@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some last minute fixes - most of them for regressions.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmQz3aoPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRp/MQIAJ0uzzyTAg7Bx/73M7JZckxgScyVOI/192mw
ylT/1EZrlZi8JOC1uf9gC+fHlg4jgS/0Yn6HSJlBH0CAJPRNpGckzA1aEr/gc9jS
2wWxCrUeO2hvlMN7HToesmZxu79xKDIoYIFnwVV/IDLTi9Y8Fh+hUIkRR16IBiXb
dgrI2qkYvyqGj5hJDnM9MaMVbTARYXb2q2SgiqGjh/TUNrkqDI0e0m1tj9eXNR90
uIz1lfbIH6JKiq2Zq/CK+h87dMOVBNoH+LZvMRKPx0beEfxC4P/1Re1Rag12kG1s
TE7n556BOsqBQ333+rIgtNSq2dDAepG+CASZsR6Dw4F+41G43Qo=
=WjYr
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Some last minute fixes - most of them for regressions"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa_sim_net: complete the initialization before register the device
vdpa/mlx5: Add and remove debugfs in setup/teardown driver
tools/virtio: fix typo in README instructions
vhost-scsi: Fix crash during LUN unmapping
vhost-scsi: Fix vhost_scsi struct use after free
virtio-blk: fix ZBD probe in kernels without ZBD support
virtio-blk: fix to match virtio spec
block size is one very key setting for block layer, and bad block size
could panic kernel easily.
Make sure that block size is set correctly.
Meantime if ublk_validate_params() fails, clear ub->params so that disk
is prevented from being added.
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reported-and-tested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since SQE memory is shared with userspace, we should only be reading it
once. We cannot read it multiple times, particularly when it's read once
for validation and then read again for the actual use.
ublk_ch_uring_cmd() is safe when called as a retry operation, as the
memory backing is stable at that point. But for normal issue, we want
to ensure that we only read ublksrv_io_cmd once. Wrap the function in
a helper that reads the value into an on-stack copy of the struct.
Cc: stable@vger.kernel.org # 6.0+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in floppy code.
Also allocation buffer exactly PAGE_SIZE << MAX_ORDER bytes is okay. Fix
MAX_LEN check.
Link: https://lkml.kernel.org/r/20230315113133.11326-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Denis Efremov <efremov@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kvfree_rcu() macro's single-argument form is deprecated. Therefore
switch to the new kvfree_rcu_mightsleep() variant. The goal is to
avoid accidental use of the single-argument forms, which can introduce
functionality bugs in atomic contexts and latency bugs in non-atomic
contexts.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Begrudgingly-acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
When the kernel is built without support for zoned block devices,
virtio-blk probe needs to error out any host-managed device scans
to prevent such devices from appearing in the system as non-zoned.
The current virtio-blk code simply bypasses all ZBD checks if
CONFIG_BLK_DEV_ZONED is not defined and this leads to host-managed
block devices being presented as non-zoned in the OS. This is one of
the main problems this patch series is aimed to fix.
In this patch, make VIRTIO_BLK_F_ZONED feature defined even when
CONFIG_BLK_DEV_ZONED is not. This change makes the code compliant with
the voted revision of virtio-blk ZBD spec. Modify the probe code to
look at the situation when VIRTIO_BLK_F_ZONED is negotiated in a kernel
that is built without ZBD support. In this case, the code checks
the zoned model of the device and fails the probe is the device
is host-managed.
The patch also adds the comment to clarify that the call to perform
the zoned device probe is correctly placed after virtio_device ready().
Fixes: 95bfec41bd ("virtio-blk: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Message-Id: <20230330214953.1088216-3-dmitry.fomichev@wdc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The merged patch series to support zoned block devices in virtio-blk
is not the most up to date version. The merged patch can be found at
https://lore.kernel.org/linux-block/20221016034127.330942-3-dmitry.fomichev@wdc.com/
but the latest and reviewed version is
https://lore.kernel.org/linux-block/20221110053952.3378990-3-dmitry.fomichev@wdc.com/
The reason is apparently that the correct mailing lists and
maintainers were not copied.
The differences between the two are mostly cleanups, but there is one
change that is very important in terms of compatibility with the
approved virtio-zbd specification.
Before it was approved, the OASIS virtio spec had a change in
VIRTIO_BLK_T_ZONE_APPEND request layout that is not reflected in the
current virtio-blk driver code. In the running code, the status is
the first byte of the in-header that is followed by some pad bytes
and the u64 that carries the sector at which the data has been written
to the zone back to the driver, aka the append sector.
This layout turned out to be problematic for implementing in QEMU and
the request status byte has been eventually made the last byte of the
in-header. The current code doesn't expect that and this causes the
append sector value always come as zero to the block layer. This needs
to be fixed ASAP.
Fixes: 95bfec41bd ("virtio-blk: add support for zoned block devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Message-Id: <20230330214953.1088216-2-dmitry.fomichev@wdc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We need the fixes in here for testing, as well as the driver core
changes for documentation updates to build on.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
'struct ublk_map_data' is passed to ublk_copy_user_pages()
for copying data between userspace buffer and request pages.
Here what matters is userspace buffer address/len and 'struct request',
so replace ->io field with user buffer address, and rename max_bytes
as len.
Meantime remove 'ubq' field from ublk_map_data, since it isn't used
any more.
Then code becomes more readable.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the following pattern in several helpers
if (Z)
return true
return false
into:
return Z;
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add two helpers for checking if map/unmap is needed, since we may have
passthrough request which needs map or unmap in future, such as for
supporting report zones.
Meantime don't mark ublk_copy_user_pages as inline since this function
is a bit fat now.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There isn't data in request of REQ_OP_FLUSH always, so don't consider
it in both ublk_map_io() and ublk_unmap_io().
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Simplify exit handling a bit, and prepare for supporting fused command.
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to support multiple connections, we need to know which
one we need to modify the request state for.
Originally-from: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass a peer device parameter through the bitmap I/O functions to the I/O
handlers. In after_state_ch(), set that parameter when queuing the
drbd_send_bitmap operation so that this operation knows where to send the
bitmap.
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230330102744.2128122-2-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the deprecated API kmap_atomic() and kunmap_atomic() with
kmap_local_page() and kunmap_local() in null_flush_cache_page().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230330184926.64209-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use library helper memcpy_page() to copy source page into destination
instead of having duplicate code in copy_to_nullb() & copy_from_nullb().
In copy_from_nullb() also replace the memset call with zero_user().
Use library helper memset_page() to set the buffer to 0xFF instead of
having duplicate code.
This also removes deprecated API kmap_atomic() from copy_to_nullb()
copy_from_nullb() and nullb_fill_pattern(),
from :include/linux/highmem.h:
"kmap_atomic - Atomically map a page for temporary usage - Deprecated!"
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230330184926.64209-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct class should never be modified in a sysfs callback as there is
nothing in the structure to modify, and frankly, the structure is almost
never used in a sysfs callback, so mark it as constant to allow struct
class to be moved to read-only memory.
While we are touching all class sysfs callbacks also mark the attribute
as constant as it can not be modified. The bonding code still uses this
structure so it can not be removed from the function callbacks.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Bartosz Golaszewski <brgl@bgdev.pl>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steve French <sfrench@samba.org>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: linux-cifs@vger.kernel.org
Cc: linux-gpio@vger.kernel.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: netdev@vger.kernel.org
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230325084537.3622280-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall. When
using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
each partition found on the loop device after the second ioctl(), but
when using LOOP_CONFIGURE, no such uevent was being sent.
In the old setup, uevents are disabled for LOOP_SET_FD, but not for
LOOP_SET_STATUS64. This makes sense, as it prevents uevents being
sent for a partially configured device during LOOP_SET_FD - they're
only sent at the end of LOOP_SET_STATUS64. But for LOOP_CONFIGURE,
uevents were disabled for the entire operation, so that final
notification was never issued. To fix this, reduce the critical
section to exclude the loop_reread_partitions() call, which causes
the uevents to be issued, to after uevents are re-enabled, matching
the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
I noticed this because Busybox's losetup program recently changed from
using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
my setup, for which I want a notification from the kernel any time a
new partition becomes available.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
[hch: reduced the critical section]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 3448914e8c ("loop: Add LOOP_CONFIGURE ioctl")
Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct bus_type should never be modified in a sysfs callback as there is
nothing in the structure to modify, and frankly, the structure is almost
never used in a sysfs callback, so mark it as constant to allow struct
bus_type to be moved to read-only memory.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Ben Widawsky <bwidawsk@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Harald Freudenberger <freude@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hu Haowen <src.res@email.cn>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Stuart Yoder <stuyoder@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yanteng Si <siyanteng@loongson.cn>
Acked-by: Ilya Dryomov <idryomov@gmail.com> # rbd
Acked-by: Ira Weiny <ira.weiny@intel.com> # cxl
Reviewed-by: Alex Shi <alexs@kernel.org>
Acked-by: Iwona Winiarska <iwona.winiarska@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com> # scsi
Link: https://lore.kernel.org/r/20230313182918.1312597-23-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
io_uring_cmd_done() currently assumes that the uring_lock is held
when invoked, and while it generally is, this is not guaranteed.
Pass in the issue_flags associated with it, so that we have
IO_URING_F_UNLOCKED available to be able to lock the CQ ring
appropriately when completing events.
Cc: stable@vger.kernel.org
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IO can be started before add_disk() returns, such as reading parititon table,
then the monitor work should work for making forward progress.
So mark device as LIVE before adding disk, meantime change to
DEAD if add_disk() fails.
Fixed: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230318141231.55562-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something. So just remove it and fix up all callers of the function in
the kernel tree at the same time.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is no need to manually set the owner of a struct class, as the
registering function does it automatically, so remove all of the
explicit settings from various drivers that did so as it is unneeded.
This allows us to remove this pointer entirely from this structure going
forward.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230313181843.1207845-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In vdc_port_probe(), we should check the return value of mdesc_grab() as
it may return NULL, which can cause potential NPD bug.
Fixes: 43fdf27470 ("[SPARC64]: Abstract out mdesc accesses for better MD update handling.")
Signed-off-by: Liang He <windhl@126.com>
Link: https://lore.kernel.org/r/20230315062032.1741692-1-windhl@126.com
[axboe: style cleanup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use a local struct request pointer variable to avoid having to
dereference struct blk_mq_queue_data multiple times. While at it, also
fix the function argument indentation and remove a useless "else" after
a return.
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230314041106.19173-2-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When injecting a fake timeout into the null_blk driver using
fail_io_timeout, the request timeout handler does not execute
blk_mq_complete_request(), so the complete callback is never executed
for a timedout request.
The null_blk driver also has a driver-specific fake timeout mechanism
which does not have this problem. Fix the problem with fail_io_timeout
by using the same meachanism as null_blk internal timeout feature, using
the fake_timeout field of null_blk commands.
Reported-by: Akinobu Mita <akinobu.mita@gmail.com>
Fixes: de3510e52b ("null_blk: fix command timeout completion handling")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230314041106.19173-2-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
do_req_filebacked() calls blk_mq_complete_request() synchronously or
asynchronously when using asynchronous I/O unless memory allocation fails.
Hence, modify loop_handle_cmd() such that it does not dereference 'cmd' nor
'rq' after do_req_filebacked() finished unless we are sure that the request
has not yet been completed. This patch fixes the following kernel crash:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000054
Call trace:
css_put.42938+0x1c/0x1ac
loop_process_work+0xc8c/0xfd4
loop_rootcg_workfn+0x24/0x34
process_one_work+0x244/0x558
worker_thread+0x400/0x8fc
kthread+0x16c/0x1e0
ret_from_fork+0x10/0x20
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Fixes: c74d40e8b5 ("loop: charge i/o to mem and blk cg")
Fixes: bc07c10a36 ("block: loop: support DIO & AIO")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230314182155.80625-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the macro for checking presence of required attributes.
It has the advantage of reporting to the user which attr
was missing in a machine-readable format (extack).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-2-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NBD doesn't have much to do with networking, allow users outside
init_net to access the family.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230224021301.1630703-1-kuba@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
CONFIG_* switches should not be exposed in uapi headers, thus let's get
rid of the USE_WCACHING macro here (which was also named way to generic)
and integrate the logic directly in the only function that needs it.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB57MQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgputpEADVrc1OFzHOivJq+LJ3HS3ufhLBthtgu1Lp
sEHvDNp9tBGXMLkomuCYpAju5TBAEKC+AJTZyj9iS1j++ItoezdoP55YRIH7t2Or
UTy8ex3rLPGkQk6k3o8roWCyajTW/ZS+4fmk+NkVYMLsQBp9I+kFbxgJa5bbREdU
Z8b/9hcBGz58R8Kq+TEMp/bO7oCV4c8xWumrKER+MktDDx0kc5d+afWXoy7bEKFg
jLB3gleTM9HUpa9a2GPc4fxqdb0KanQdMtiyn/oplg0JcZLMiHfRbiRnsgQkjN0O
RVtUcdxXmOkQeFra4GXPiHmQBcIfE85wP4wxb8p/F2StYRhb1epzzeCXOhuNZvv4
dd6OSARgtzWt3OlHka4aC63H4kzs9SxJp0F2uwuPLV0fM91TP1oOTWV+53FrQr9Z
OQYyB8d9Il4K72NFLwU4ukJ1fPoCRHjpgAXIIkasEjaBftpJlMNnfblncTZTBumy
XumFVdKfvqc3OFt8LLKWqLDV0j3TknVeCMPKhsbRwQ0NG4vlNOSWaLkGJCDLJ7ga
ebf8AD5eaLCT9qyYquBuW5VBKZH5Z4rf5yHta9Dx+Omu0JTQYtTkiiM3UTdpDbtq
SObZ31UvLoYK2dOZcVgjhE2RgM/AV5jJcx7aHhT3UptavAehHbePgiNhuEEntlKv
L87kXJkSSQ==
=ezrg
-----END PGP SIGNATURE-----
Merge tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- Don't access released socket during error recovery (Akinobu
Mita)
- Bring back auto-removal of deleted namespaces during sequential
scan (Christoph Hellwig)
- Fix an error code in nvme_auth_process_dhchap_challenge (Dan
Carpenter)
- Show well known discovery name (Daniel Wagner)
- Add a missing endianess conversion in effects masking (Keith
Busch)
- Fix for a regression introduced in blk-rq-qos during init in this
merge window (Breno)
- Reorder a few fields in struct blk_mq_tag_set, eliminating a few
holes and shrinking it (Christophe)
- Remove redundant bdev_get_queue() NULL checks (Juhyung)
- Add sed-opal single user mode support flag (Luca)
- Remove SQE128 check in ublk as it isn't needed, saving some memory
(Ming)
- Op specific segment checking for cloned requests (Uday)
- Exclusive open partition scan fixes (Yu)
- Loop offset/size checking before assigning them in the device (Zhong)
- Bio polling fixes (me)
* tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux:
blk-mq: enforce op-specific segment limits in blk_insert_cloned_request
nvme-fabrics: show well known discovery name
nvme-tcp: don't access released socket during error recovery
nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge()
nvme: bring back auto-removal of deleted namespaces during sequential scan
blk-iocost: Pass gendisk to ioc_refresh_params
nvme: fix sparse warning on effects masking
block: be a bit more careful in checking for NULL bdev while polling
block: clear bio->bi_bdev when putting a bio back in the cache
loop: loop_set_status_from_info() check before assignment
ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd
block: remove more NULL checks after bdev_get_queue()
blk-mq: Reorder fields in 'struct blk_mq_tag_set'
block: fix scan partition for exclusively open device again
block: Revert "block: Do not reread partition table on exclusively open device"
sed-opal: add support flag for SUM in status ioctl
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmQA0uETHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi92pB/4yZ7Go/7j2zb84N9nEYPCHV23v1vED
YGZIiWHYv6X3dJyTYpcU7Mn9TF00naTGDKi9NpTZjKOUIkibXPFJfbG7Dh4T2HhN
TKw9EbldCaXE1mR7o+g/mrVQFM1PIR1VbtIeszL3eD2qO0aXEGyBMvPfUNqFX/M7
lNWVjuglIaYUL235Uid/wt0zfmPDvtGD24fjpN0e22UQh/aBFnodIDpa/AapsFKp
yifzqe/ADbvgnHwOhMiEMG1gRFd3vywVfPDQmQ41oSMnf7yTtLWE9t47wTfyoTY5
IwZY2K1H51QJej/mObYJmClp/y81xSLXEydFdQ571MqZbDeDfQeM23/7
=cWWl
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.3-rc1' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two small fixes from Xiubo and myself, marked for stable"
* tag 'ceph-for-6.3-rc1' of https://github.com/ceph/ceph-client:
rbd: avoid use-after-free in do_rbd_add() when rbd_dev_create() fails
ceph: update the time stamps and try to drop the suid/sgid
If getting an ID or setting up a work queue in rbd_dev_create() fails,
use-after-free on rbd_dev->rbd_client, rbd_dev->spec and rbd_dev->opts
is triggered in do_rbd_add(). The root cause is that the ownership of
these structures is transfered to rbd_dev prematurely and they all end
up getting freed when rbd_dev_create() calls rbd_dev_free() prior to
returning to do_rbd_add().
Found by Linux Verification Center (linuxtesting.org) with SVACE, an
incomplete patch submitted by Natalia Petrova <n.petrova@fintech.ru>.
Cc: stable@vger.kernel.org
Fixes: 1643dfa4c2 ("rbd: introduce a per-device ordered workqueue")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
device feature provisioning in ifcvf, mlx5
new SolidNET driver
support for zoned block device in virtio blk
numa support in virtio pmem
VIRTIO_F_RING_RESET support in vhost-net
more debugfs entries in mlx5
resume support in vdpa
completion batching in virtio blk
cleanup of dma api use in vdpa
now simulating more features in vdpa-sim
documentation, features, fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmP0D98PHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpV6IH/iecRgLMWWjp3n31IFdu31f/J4HpF7dczVjK
qtV98eJ1N2pkgeJkdCfmB5XszfvFBeAurrS7++FTHiJhrRfR3Z+2ml/Qtvh5DEyP
qxz6wOw6VVsi/txdUxM1wsxLeEmmzkmFdAmPM+FXeIjhWj76GOgy/4A3eaj6TgzV
W8ShsBve/UZ5qMOC3XbIscvdOrudHJ18tH90Tiz3NZfH1fAs5E4uWbU6Mrz9DJVr
canGvf4kAI9z8qram5HSgzPIXRJEYiF4q/eiStdtiiME8gL1mHLRZDNP1I1LeCAb
q6Q6RCRKi3Ek+LGdH6u+nR1Swu03N2b/g+vgKtv30kJo06oZVzw=
=EasV
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- device feature provisioning in ifcvf, mlx5
- new SolidNET driver
- support for zoned block device in virtio blk
- numa support in virtio pmem
- VIRTIO_F_RING_RESET support in vhost-net
- more debugfs entries in mlx5
- resume support in vdpa
- completion batching in virtio blk
- cleanup of dma api use in vdpa
- now simulating more features in vdpa-sim
- documentation, features, fixes all over the place
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (64 commits)
vdpa/mlx5: support device features provisioning
vdpa/mlx5: make MTU/STATUS presence conditional on feature bits
vdpa: validate device feature provisioning against supported class
vdpa: validate provisioned device features against specified attribute
vdpa: conditionally read STATUS in config space
vdpa: fix improper error message when adding vdpa dev
vdpa/mlx5: Initialize CVQ iotlb spinlock
vdpa/mlx5: Don't clear mr struct on destroy MR
vdpa/mlx5: Directly assign memory key
tools/virtio: enable to build with retpoline
vringh: fix a typo in comments for vringh_kiov
vhost-vdpa: print warning when vhost_vdpa_alloc_domain fails
scsi: virtio_scsi: fix handling of kmalloc failure
vdpa: Fix a couple of spelling mistakes in some messages
vhost-net: support VIRTIO_F_RING_RESET
vhost-scsi: convert sysfs snprintf and sprintf to sysfs_emit
vdpa: mlx5: support per virtqueue dma device
vdpa: set dma mask for vDPA device
virtio-vdpa: support per vq dma device
vdpa: introduce get_vq_dma_device()
...
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
In loop_set_status_from_info(), lo->lo_offset and lo->lo_sizelimit should
be checked before reassignment, because if an overflow error occurs, the
original correct value will be changed to the wrong value, and it will not
be changed back.
More, the original patch did not solve the problem, the value was set and
ioctl returned an error, but the subsequent io used the value in the loop
driver, which still caused an alarm:
loop_handle_cmd
do_req_filebacked
loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
lo_rw_aio
cmd->iocb.ki_pos = pos
Fixes: c490a0b5a4 ("loop: Check for overflow while configuring loop")
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230221095027.3656193-1-zhongjinghua@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Small cleanup of the pata_octeon driver to drop a useless platform
callback, from Uwe.
* Simplify ata_scsi_cmd_error_handler() code using the fact that
ap->ops->error_handler is NULL most of the time, from Wenchao.
* Several patches improving libata error handling. This is in
preparation for supporting the command duration limits (CDL)
feature. The changes allow handling corner cases of ATA NCQ errors
which do not happen with regular drives but will be triggered with
CDL drives. From Niklas.
* Simplify the qc_fill_rtf operation, from me.
* Improve SCSI command translation for the
REPORT_SUPPORTED_OPERATION_CODES command, from me.
* Cleanup of libata FUA handling. This falls short of enabling FUA for
ATA drives that support it by default as there were concerns that
old drives would break. The series howeverfixes several issues with
the FUA support to ensure that FUA is reported as being supported
only for drives that can handle all possible write cases (NCQ and
non-NCQ). A check in the block layer is also added to ensure that we
never see read FUA commands (current behavior). From me.
* Several patches to move the old PARIDE (parallel port IDE) driver to
libata as pata_parport. Given that this driver also needs protocol
modules, the driver code resides in its own pata_parport directoy
under drivers/ata. From Ondrej.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY/VTnQAKCRDdoc3SxdoY
dk77AQCA1frczKhcOFe2PK/FsFAiO9Nlx/snk7V95JdjVG8GlwEAkey7mvbXMfX0
fDbqpaCkWFb6SvwxdMSATlqUvwEpSQ8=
=tqQP
-----END PGP SIGNATURE-----
Merge tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA updates from Damien Le Moal:
- Small cleanup of the pata_octeon driver to drop a useless platform
callback (Uwe)
- Simplify ata_scsi_cmd_error_handler() code using the fact that
ap->ops->error_handler is NULL most of the time (Wenchao)
- Several patches improving libata error handling. This is in
preparation for supporting the command duration limits (CDL) feature.
The changes allow handling corner cases of ATA NCQ errors which do
not happen with regular drives but will be triggered with CDL drives
(Niklas)
- Simplify the qc_fill_rtf operation (me)
- Improve SCSI command translation for REPORT_SUPPORTED_OPERATION_CODES
command (me)
- Cleanup of libata FUA handling.
This falls short of enabling FUA for ATA drives that support it by
default as there were concerns that old drives would break. The
series however fixes several issues with the FUA support to ensure
that FUA is reported as being supported only for drives that can
handle all possible write cases (NCQ and non-NCQ). A check in the
block layer is also added to ensure that we never see read FUA
commands (current behavior) (me)
- Several patches to move the old PARIDE (parallel port IDE) driver to
libata as pata_parport. Given that this driver also needs protocol
modules, the driver code resides in its own pata_parport directoy
under drivers/ata (Ondrej)
* tag 'ata-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_parport: Fix ida_alloc return value error check
drivers/block: Move PARIDE protocol modules to drivers/ata/pata_parport
drivers/block: Remove PARIDE core and high-level protocols
ata: pata_parport: add driver (PARIDE replacement)
ata: libata: exclude FUA support for known buggy drives
ata: libata: Fix FUA handling in ata_build_rw_tf()
ata: libata: cleanup fua support detection
ata: libata: Rename and cleanup ata_rwcmd_protocol()
ata: libata: Introduce ata_ncq_supported()
block: add a sanity check for non-write flush/fua bios
ata: libata-scsi: improve ata_scsiop_maint_in()
ata: libata-scsi: do not overwrite SCSI ML and status bytes
ata: libata: move NCQ related ATA_DFLAGs
ata: libata: respect successfully completed commands during errors
ata: libata: read the shared status for successful NCQ commands once
ata: libata: simplify qc_fill_rtf port operation interface
ata: scsi: rename flag ATA_QCFLAG_FAILED to ATA_QCFLAG_EH
ata: libata-eh: Cleanup ata_scsi_cmd_error_handler()
ata: octeon: Drop empty platform remove function
sizeof(struct ublksrv_io_cmd) is 16bytes, which can be held in 64byte SQE,
so not necessary to check IO_URING_F_SQE128.
With this change, we get chance to save half SQ ring memory.
Fixed: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230220041413.1524335-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds completion batching to the IRQ path. It reuses batch
completion code of virtblk_poll(). It collects requests to io_comp_batch
and processes them all at once. It can boost up the performance by 2%.
To validate the performance improvement and stabilty, I did fio test with
4 vCPU VM and 12 vCPU VM respectively. Both VMs have 8GB ram and the same
number of HW queues as vCPU.
The fio cammad is as follows and I ran the fio 5 times and got IOPS average.
(io_uring, randread, direct=1, bs=512, iodepth=64 numjobs=2,4)
Test result shows about 2% improvement.
4 vcpu VM | numjobs=2 | numjobs=4
-----------------------------------------------------------
fio without patch | 367.2K IOPS | 397.6K IOPS
-----------------------------------------------------------
fio with patch | 372.8K IOPS | 407.7K IOPS
12 vcpu VM | numjobs=2 | numjobs=4
-----------------------------------------------------------
fio without patch | 363.6K IOPS | 374.8K IOPS
-----------------------------------------------------------
fio with patch | 373.8K IOPS | 385.3K IOPS
Signed-off-by: Suwan Kim <suwan.kim027@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20221221145456.281218-3-suwan.kim027@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Driver should set req->state to MQ_RQ_COMPLETE after it finishes to process
req. But virtio-blk doesn't set MQ_RQ_COMPLETE after virtblk_poll() handles
req and req->state still remains MQ_RQ_IN_FLIGHT. Fortunately so far there
is no issue about it because blk_mq_end_request_batch() sets req->state to
MQ_RQ_IDLE.
In this patch, virblk_poll() calls blk_mq_complete_request_remote() to set
req->state to MQ_RQ_COMPLETE before it adds req to a batch completion list.
So it properly sets req->state after polling I/O is finished.
Fixes: 4e04005256 ("virtio-blk: support polling I/O")
Signed-off-by: Suwan Kim <suwan.kim027@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20221221145456.281218-2-suwan.kim027@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
/DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
+K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
2uEgcHzHQw==
=poRc
-----END PGP SIGNATURE-----
Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Christoph:
- Small improvements to the logging functionality (Amit Engel)
- Authentication cleanups (Hannes Reinecke)
- Cleanup and optimize the DMA mapping cod in the PCIe driver
(Keith Busch)
- Work around the command effects for Format NVM (Keith Busch)
- Misc cleanups (Keith Busch, Christoph Hellwig)
- Fix and cleanup freeing single sgl (Keith Busch)
- MD updates via Song:
- Fix a rare crash during the takeover process
- Don't update recovery_cp when curr_resync is ACTIVE
- Free writes_pending in md_stop
- Change active_io to percpu
- Updates to drbd, inching us closer to unifying the out-of-tree driver
with the in-tree one (Andreas, Christoph, Lars, Robert)
- BFQ update adding support for multi-actuator drives (Paolo, Federico,
Davide)
- Make brd compliant with REQ_NOWAIT (me)
- Fix for IOPOLL and queue entering, fixing stalled IO waiting on
timeouts (me)
- Fix for REQ_NOWAIT with multiple bios (me)
- Fix memory leak in blktrace cleanup (Greg)
- Clean up sbitmap and fix a potential hang (Kemeng)
- Clean up some bits in BFQ, and fix a bug in the request injection
(Kemeng)
- Clean up the request allocation and issue code, and fix some bugs
related to that (Kemeng)
- ublk updates and fixes:
- Add support for unprivileged ublk (Ming)
- Improve device deletion handling (Ming)
- Misc (Liu, Ziyang)
- s390 dasd fixes (Alexander, Qiheng)
- Improve utility of request caching and fixes (Anuj, Xiao)
- zoned cleanups (Pankaj)
- More constification for kobjs (Thomas)
- blk-iocost cleanups (Yu)
- Remove bio splitting from drivers that don't need it (Christoph)
- Switch blk-cgroups to use struct gendisk. Some of this is now
incomplete as select late reverts were done. (Christoph)
- Add bvec initialization helpers, and convert callers to use that
rather than open-coding it (Christoph)
- Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
Matthew, Ulf, Zhong)
* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
brd: use radix_tree_maybe_preload instead of radix_tree_preload
block: use proper return value from bio_failfast()
block: bio-integrity: Copy flags when bio_integrity_payload is cloned
block: Fix io statistics for cgroup in throttle path
brd: mark as nowait compatible
brd: check for REQ_NOWAIT and set correct page allocation mask
brd: return 0/-error from brd_insert_page()
block: sync mixed merged request's failfast with 1st bio's
Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
Revert "blk-cgroup: pass a gendisk to blkg_lookup"
Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
Revert "blk-cgroup: move the cgroup information to struct gendisk"
nvme-pci: remove iod use_sgls
nvme-pci: fix freeing single sgl
block: ublk: check IO buffer based on flag need_get_data
s390/dasd: Fix potential memleak in dasd_eckd_init()
s390/dasd: sort out physical vs virtual pointers usage
block: Remove the ALLOC_CACHE_SLACK constant
block: make kobj_type structures constant
...
Unconditionally calling radix_tree_preload_end() results in a OOPS
message as the preload is only conditionally called for
gfpflags_allow_blocking().
[ 20.267323] BUG: using smp_processor_id() in preemptible [00000000] code: fio/416
[ 20.267837] caller is brd_insert_page.part.0+0xbe/0x190 [brd]
[ 20.269436] Call Trace:
[ 20.269598] <TASK>
[ 20.269742] dump_stack_lvl+0x32/0x50
[ 20.269982] check_preemption_disabled+0xd1/0xe0
[ 20.270289] brd_insert_page.part.0+0xbe/0x190 [brd]
[ 20.270664] brd_submit_bio+0x33f/0xf40 [brd]
Use radix_tree_maybe_preload() which does preload only if
gfpflags_allow_blocking() is true but also takes the lock. Therefore,
unconditionally calling radix_tree_preload_end() should not create any
issues and the message disappears.
Fixes: 6ded703c56 ("brd: check for REQ_NOWAIT and set correct page allocation mask")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20230217121442.33914-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
By default, non-mq drivers do not support nowait. This causes io_uring
to use a slower path as the driver cannot be trust not to block. brd
can safely set the nowait flag, as worst case all it does is a NOIO
allocation.
For io_uring, this makes a substantial difference. Before:
submitter=0, tid=453, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=440.03K, BW=1718MiB/s, IOS/call=32/31
IOPS=428.96K, BW=1675MiB/s, IOS/call=32/32
IOPS=442.59K, BW=1728MiB/s, IOS/call=32/31
IOPS=419.65K, BW=1639MiB/s, IOS/call=32/32
IOPS=426.82K, BW=1667MiB/s, IOS/call=32/31
and after:
submitter=0, tid=354, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=3.37M, BW=13.15GiB/s, IOS/call=32/31
IOPS=3.45M, BW=13.46GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.42GiB/s, IOS/call=32/32
IOPS=3.43M, BW=13.39GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.38GiB/s, IOS/call=32/31
or about an 8x in difference. Now that brd is prepared to deal with
REQ_NOWAIT reads/writes, mark it as supporting that.
Cc: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-block/20230203103005.31290-1-p.raghav@samsung.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If REQ_NOWAIT is set, then do a non-blocking allocation if the operation
is a write and we need to insert a new page. Currently REQ_NOWAIT cannot
be set as the queue isn't marked as supporting nowait, this change is in
preparation for allowing that.
radix_tree_preload() warns on attempting to call it with an allocation
mask that doesn't allow blocking. While that warning could arguably
be removed, we need to handle radix insertion failures anyway as they
are more likely if we cannot block to get memory.
Remove legacy BUG_ON()'s and turn them into proper errors instead, one
for the allocation failure and one for finding a page that doesn't
match the correct index.
Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It currently returns a page, but callers just check for NULL/page to
gauge success. Clean this up and return the appropriate error directly
instead.
Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
virtio blk returns a 64 bit append_sector in an input buffer,
in LE format. This field is not tagged as LE correctly, so
even though the generated code is ok, we get warnings from sparse:
drivers/block/virtio_blk.c:332:33: sparse: sparse: cast to restricted __le64
Make sparse happy by using the correct type.
Message-Id: <20221220125154.564265-1-mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
virtblk_result returns blk_status_t which is a bitwise restricted type,
so we are not supposed to stuff it in a plain int temporary variable.
All we do with it is pass it on to a function expecting blk_status_t so
the generated code is ok, but we get warnings from sparse:
drivers/block/virtio_blk.c:326:36: sparse: sparse: incorrect type in initializer (different base types) @@ expected int status @@
+got restricted blk_status_t @@
drivers/block/virtio_blk.c:334:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted
+blk_status_t [usertype] error @@ got int status @@
Make sparse happy by using the correct type.
Message-Id: <20221220124152.523531-1-mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
This patch adds support for Zoned Block Devices (ZBDs) to the kernel
virtio-blk driver.
The patch accompanies the virtio-blk ZBD support draft that is now
being proposed for standardization. The latest version of the draft is
linked at
https://github.com/oasis-tcs/virtio-spec/issues/143 .
The QEMU zoned device code that implements these protocol extensions
has been developed by Sam Li and it is currently in review at the QEMU
mailing list.
A number of virtblk request structure changes has been introduced to
accommodate the functionality that is specific to zoned block devices
and, most importantly, make room for carrying the Zoned Append sector
value from the device back to the driver along with the request status.
The zone-specific code in the patch is heavily influenced by NVMe ZNS
code in drivers/nvme/host/zns.c, but it is simpler because the proposed
virtio ZBD draft only covers the zoned device features that are
relevant to the zoned functionality provided by Linux block layer.
includes the following fixup:
virtio-blk: fix probe without CONFIG_BLK_DEV_ZONED
When building without CONFIG_BLK_DEV_ZONED, VIRTIO_BLK_F_ZONED
is excluded from array of driver features.
As a result virtio_has_feature panics in virtio_check_driver_offered_feature
since that by design verifies that a feature we are checking for
is listed in the feature array.
To fix, replace the call to virtio_has_feature with a stub.
Message-Id: <20221016034127.330942-3-dmitry.fomichev@wdc.com>
Co-developed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Message-Id: <20221220112340.518841-1-mst@redhat.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Debugged-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Currently, uring_cmd with UBLK_IO_FETCH_REQ or
UBLK_IO_COMMIT_AND_FETCH_REQ is always checked whether
userspace server has provided IO buffer even flag
UBLK_F_NEED_GET_DATA is configured.
This is a excessive check. If UBLK_F_NEED_GET_DATA is
configured, FETCH_RQ doesn't need to provide IO buffer;
COMMIT_AND_FETCH_REQ also doesn't need to do that if
the IO type is not READ.
Check ub_cmd->addr together with ublk_need_get_data()
and IO type in ublk_ch_uring_cmd().
With this fix, userspace server doesn't need to preserve
buffers for every ublk_io when flag UBLK_F_NEED_GET_DATA
is configured, in order to save memory.
Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Fixes: c86019ff75 ("ublk_drv: add support for UBLK_IO_NEED_GET_DATA")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230210141356.112321-1-xiaodong.liu@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inside ublk_ctrl_del_dev(), when the device is removed, we wait
until the device number is freed with holding global lock of
ublk_ctl_mutex, this way isn't friendly from user viewpoint:
1) if device is in-use, the current delete command hangs in
ublk_ctrl_del_dev(), and user can't break from the handling
because wait_event() is used
2) global lock is held, so any new device can't be added and
other old devices can't be removed.
Improve the deleting handling by the following way, suggested by
Nadav:
1) wait without holding the global lock
2) replace wait_event() with wait_event_interruptible()
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207150700.545530-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queuedata is not referenced in ublk_drv and we can use driver_data
instead. Pass NULL to blk_mq_alloc_disk() as queuedata while allocating
ublk's gendisk.
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-4-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
WRITE_ZEROES won't return bytes returned just like FLUSH and DISCARD,
and we can end it directly. Add missing comment for it in
ublk_complete_rq().
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-3-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_has_data() allows a NULL bio so the NULL check in
ublk_rq_has_data() is unnecessary.
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-2-ZiyangZhang@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPdRq8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjcqEADcWlRjkcLzRpEMD9g3IyDShasT1JVeSvV6
xqDuA0kRF6DyObu82jE2wiZ49FRpeCUw6S6ZdVhvwGHgPpfLBuPWonFnTqxYAnSz
XCYnt4QdZHGiydIHVxkyP8Raz6d24kZawlUmbE7dcfksNziyGR5UjbCsk1HNJhmf
EvnLZ2EozZwsZLW/RRYZrh9Q8ccB8kJeX+JuUVw7sboNyJ+bW+x+7prlm3CKgopX
IiP69E6qIPe6RHkyLRdKgYgxRdcgeq6uJk/nuZ/6uPCcyrz+0QEtge3CkTe7zLkF
CPmbWlqngmNfNsS93nPTK2kHWTz8P2spo+UTkXIegSYBA8CIr9lDxazSFKT0B6zH
yIWzmQoE7YXRI5B21rlPvNGE/gPSy48mSn1ym/MCf+UyWGneRypeU/K//2Ww3UJK
F1Xl2c1v/EEr28qPuC8VQbAsQ56GOcZ6zW4Q0grxTYm0KzzJ2O5B3FEHdCWlS/x9
KY5v3a8a3nXg9rNio0ruXiyD5l7PE5nFESNrBFDS4kEfxk4cx50ZfgDH68d515/W
//EnNjx9nN20yF+LcKD70KJHxPdWaUXGT2c1+E/tdbrgUKReCpER+5hQc8+YxQML
DCbzr7LJjX5mmDQ5YI6Y09/L6luzFMjrnxpmXkL7nyWQlSYkMqus3vPtDcJ5Xk2J
shHBlzIcuw==
=/+rE
-----END PGP SIGNATURE-----
Merge tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"A bit bigger than I'd like at this point, but mostly a bunch of little
fixes. In detail:
- NVMe pull request via Christoph:
- Fix a missing queue put in nvmet_fc_ls_create_association
(Amit Engel)
- Clear queue pointers on tag_set initialization failure
(Maurizio Lombardi)
- Use workqueue dedicated to authentication (Shin'ichiro
Kawasaki)
- Fix for an overflow in ublk (Liu)
- Fix for leaking a queue reference in block cgroups (Ming)
- Fix for a use-after-free in BFQ (Yu)"
* tag 'block-6.2-2023-02-03' of git://git.kernel.dk/linux:
blk-cgroup: don't update io stat for root cgroup
nvme-auth: use workqueue dedicated to authentication
nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set
nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set
nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association
block: Fix the blk_mq_destroy_queue() documentation
block: ublk: extending queue_size to fix overflow
block, bfq: fix uaf for bfqq in bic_set_bfqq()
Use the bvec_set_page helper to initialize bvecs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the bvec_set_virt helper to initialize the special_vec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the bvec_set_page helper to initialize the copy up bvec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ->rw_page method is a special purpose bypass of the usual bio handling
path that is limited to single-page reads and writes and synchronous which
causes a lot of extra code in the drivers, callers and the block layer.
The only remaining user is the MM swap code. Switch that swap code to
simply submit a single-vec on-stack bio an synchronously wait on it based
on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that
currently implement ->rw_page instead. While this touches one extra cache
line and executes extra code, it simplifies the block layer and drivers
and ensures that all feastures are properly supported by all drivers, e.g.
right now ->rw_page bypassed cgroup writeback entirely.
[akpm@linux-foundation.org: fix comment typo, per Dan]
Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make the following minor changes which were reported by colleagues
while reviewing this code:
- Remove the parentheses from around the LOOP_DEFAULT_HW_Q_DEPTH
definition since these are superfluous.
- Accept other number formats than decimal, e.g. hexadecimal.
- Do not set hw_queue_depth to an out-of-range value, even if that value
won't be used.
- Use the LOOP_DEFAULT_HW_Q_DEPTH macro in the kernel module parameter
description to prevent that the description gets out of sync.
This patch has been tested as follows:
# modprobe -r loop
# modprobe loop hw_queue_depth=-1
modprobe: ERROR: could not insert 'loop': Invalid argument
# modprobe loop hw_queue_depth=0
modprobe: ERROR: could not insert 'loop': Invalid argument
# modprobe loop hw_queue_depth=1; cat /sys/module/loop/parameters/hw_queue_depth
1
# modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=0x10
16
# modprobe -r loop; modprobe loop; cat /sys/module/loop/parameters/hw_queue_depth hw_queue_depth=128
128
# modprobe -r loop; modprobe loop hw_queue_depth=129; cat /sys/module/loop/parameters/hw_queue_depth
129
# modprobe -r loop; modprobe loop hw_queue_depth=$((1<<32))
modprobe: ERROR: could not insert 'loop': Numerical result out of range
See also commit ef44c50837 ("loop: allow user to set the queue
depth").
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20230130211347.832110-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>