Commit Graph

8160 Commits

Author SHA1 Message Date
Christoph Hellwig
718628adfc virtio_blk: split virtblk_probe
Split out a virtblk_read_limits helper that just reads the various
queue limits to separate it from the higher level probing logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Christoph Hellwig
27e32cd23f block: pass a queue_limits argument to blk_mq_alloc_disk
Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL.  This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:56:59 -07:00
Arnd Bergmann
fe0b1e9a73 drbd: fix function cast warnings in state machine
There are four state machines in drbd that use a common infrastructure, with
a cast to an incompatible function type in REMEMBER_STATE_CHANGE that clang-16
now warns about:

drivers/block/drbd/drbd_state.c:1632:3: error: cast from 'int (*)(struct sk_buff *, unsigned int, struct drbd_resource_state_change *, enum drbd_notification_type)' to 'typeof (last_func)' (aka 'int (*)(struct sk_buff *, unsigned int, void *, enum drbd_notification_type)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 1632 |                 REMEMBER_STATE_CHANGE(notify_resource_state_change,
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1633 |                                       resource_state_change, NOTIFY_CHANGE);
      |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/block/drbd/drbd_state.c:1619:17: note: expanded from macro 'REMEMBER_STATE_CHANGE'
 1619 |            last_func = (typeof(last_func))func; \
      |                        ^~~~~~~~~~~~~~~~~~~~~~~
drivers/block/drbd/drbd_state.c:1641:4: error: cast from 'int (*)(struct sk_buff *, unsigned int, struct drbd_connection_state_change *, enum drbd_notification_type)' to 'typeof (last_func)' (aka 'int (*)(struct sk_buff *, unsigned int, void *, enum drbd_notification_type)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 1641 |                         REMEMBER_STATE_CHANGE(notify_connection_state_change,
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1642 |                                               connection_state_change, NOTIFY_CHANGE);
      |                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Change these all to actually expect a void pointer to be passed, which
matches the caller.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213100354.457128-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:55:40 -07:00
Arnd Bergmann
7789bf0552 floppy: fix function pointer cast warnings
clang-16 complains about a control flow integrity (kcfi) violation
casting between incompatible pointers:

drivers/block/floppy.c:2001:11: error: cast from 'void (*)(void)' to 'done_f' (aka 'void (*)(int)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 2001 |         .done           = (done_f)empty
      |                           ^~~~~~~~~~~~~

Just add another empty function with the correct prototype as a
workaround.

The warning is for code that was added before the start of the normal
git history, but I tracked it done to an early change in the reconstructed
linux-history.git.

Fixes: 598a477afe06 ("Import 1.1.41")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213095918.455478-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13 08:55:33 -07:00
Linus Torvalds
a5b6244cf8 block-6.8-2024-02-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXHhIoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiKfD/4vi+EUfKGmhmXx0Sh8iDCt3g4gUH+FqtpV
 gqC4FG8gWraNi1WY98ZVtXhlaQ0ALKuPbe7pNfYpxTn8zRdtixrOBxYDvNJLAJLH
 Q2LnviBZBTrRar4d51O+d9eGC2FwbihwEW4asFn6kIlKjmJ9DT8qUYBhwLh5AcBd
 rQ2yhk9KOlQIIN/Z0gjiexXM2WFxWbeHBWKWzTgxJL7q81wAfdwMP0IkNLcASaWU
 P48YHkIhwqFfSk6QqPrjmLr5P08jd2xEbr3DA/4unuto9iQZPoS0h+k1kauA048w
 ZassSiBfIaOGuv0xQg2bYHwdoGazW0fNeyWNQjE6qDaC7ECE8oBKE0fMyhTZ5Xgo
 0d86bqlL5913PDzjp5DXeGvpSZ7hLV393TyV3yAspHosAA5cHBqNQMiJD1okM99N
 wKfmXnH2CXE29ckfRyaN+M4Ywg+gpbyMNIfQM7N9If4GuQcyxC3M67RkT2GrK2MB
 ZJ/af6pnh/d68mqteJB5gV5r8t163uoVFcbSjJhwlVVgp1Sp3+h04wsB2sDyp62h
 Guu9043fuT6zzT3EhySjPvmkKoNu6cjNofuTwNoaVVRcloRhTafn7V3sTlfjrpOP
 woWnGv5VOSnuunOGQ/bxLXvXbrnEQ5ziW1S4qwr+oi+FjP5Ae9eOsjzld4Vl7cZH
 ABqOf0UEsg==
 =0mLS
 -----END PGP SIGNATURE-----

Merge tag 'block-6.8-2024-02-10' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Update a potentially stale firmware attribute (Maurizio)
     - Fixes for the recent verbose error logging (Keith, Chaitanya)
     - Protection information payload size fix for passthrough (Francis)

 - Fix for a queue freezing issue in virtblk (Yi)

 - blk-iocost underflow fix (Tejun)

 - blk-wbt task detection fix (Jan)

* tag 'block-6.8-2024-02-10' of git://git.kernel.dk/linux:
  virtio-blk: Ensure no requests in virtqueues before deleting vqs.
  blk-iocost: Fix an UBSAN shift-out-of-bounds warning
  nvme: use ns->head->pi_size instead of t10_pi_tuple structure size
  nvme-core: fix comment to reflect right functions
  nvme: move passthrough logging attribute to head
  blk-wbt: Fix detection of dirty-throttled tasks
  nvme-host: fix the updating of the firmware version
2024-02-10 08:02:48 -08:00
Shin'ichiro Kawasaki
14509b748f null_blk: add configfs variable shared_tags
Allow setting shared_tags through configfs, which could only be set as a
module parameter. For that purpose, delay tag_set initialization from
null_init() to null_add_dev(). Refer tag_set.ops as the flag to check if
tag_set is initialized or not.

The following parameters can not be set through configfs yet:

    timeout
    requeue
    init_hctx

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240130042134.2463659-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 11:31:23 -07:00
Yi Sun
4ce6e2db00 virtio-blk: Ensure no requests in virtqueues before deleting vqs.
Ensure no remaining requests in virtqueues before resetting vdev and
deleting virtqueues. Otherwise these requests will never be completed.
It may cause the system to become unresponsive.

Function blk_mq_quiesce_queue() can ensure that requests have become
in_flight status, but it cannot guarantee that requests have been
processed by the device. Virtqueues should never be deleted before
all requests become complete status.

Function blk_mq_freeze_queue() ensure that all requests in virtqueues
become complete status. And no requests can enter in virtqueues.

Signed-off-by: Yi Sun <yi.sun@unisoc.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20240129085250.1550594-1-yi.sun@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 11:30:26 -07:00
Ricardo B. Marliere
052618c71c block: rbd: make rbd_bus_type const
Now that the driver core can properly handle constant struct bus_type,
move the rbd_bus_type variable to be a constant structure as well,
placing it into read-only memory which can not be modified at runtime.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20240204-bus_cleanup-block-v1-1-fc77afd8d7cc@marliere.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06 10:37:45 -07:00
Linus Torvalds
914e17088e block-6.8-2024-01-26
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWz+D8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjprEACQAJWX6h+bSmbxst3AkXZoNLCPq7EsripI
 igbUcdrkKmnUL+4/r1qM6sE7I7nIILBWzSsFgawC91tq2/9KWq7P7FeuEHdiRmsS
 RwXuTdJiQBvrDy2k59I0vtiE+zrbrhmLGcdIg4xbKl8bg17nA6KaICf4oxCqkPZp
 R+WvcZJvfKpN8bnd3bjJOK2vwYL26HIPhUU0KyWPAc4Nx7K5pe9RT+4kWY4vmklY
 XqOxfnoPiKBKlXApaSDRTFITPhg/usq6vfPgtADvdGt7ieMbN1YplLNO82kHjUA9
 sp/biSBtOT12JiH+biRFcSjNPyBOzu13Opshi0Ou4uNTtyH8FXqXMVLGfl/uDRgW
 khkcY8rfRXWpbPrMmqlNExeZ47ZztSNKoqVXCV1yNUOTlGQvFbhiRDOUtG3O+wPD
 df8MrPc5P5jVVUGhDSTHZtXlYbkjsmodIVSsI/YFwZ3XB0tz9NqTZtO7INa0BZku
 llIgDzBhk1NPqlZCpaLNvEFUWmjoHqktwyDqe3BuWpwjUOZnLvVKic2RkK7Y8ufF
 MG6qn9VZ4qrb2rxmsM4yd7V3nNRpnmVWlN6feIP3Rlb+M2+nGAyAw+pZxJ6iF+Hi
 bdvbP6UnaA5iZqSASifjDl8WCSllymPDwMPZGDsaEi/5CW8JMpdYwlCrBrY19hlm
 8Wq6YGCFXQ==
 =A7z9
 -----END PGP SIGNATURE-----

Merge tag 'block-6.8-2024-01-26' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - RCU warning fix for md (Mikulas)

 - Fix for an aoe issue that lockdep rightfully complained about
   (Maksim)

 - Fix for an error code change in partitioning that caused a regression
   with some tools (Li)

 - Fix for a data direction warning with bi-direction commands
   (Christian)

* tag 'block-6.8-2024-01-26' of git://git.kernel.dk/linux:
  md: fix a suspicious RCU usage warning
  aoe: avoid potential deadlock at set_capacity
  block: Fix WARNING in _copy_from_iter
  block: Move checking GENHD_FL_NO_PART to bdev_add_partition()
2024-01-26 15:19:43 -08:00
Maksim Kiselev
e169bd4fb2 aoe: avoid potential deadlock at set_capacity
Move set_capacity() outside of the section procected by (&d->lock).
To avoid possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
[1] lock(&bdev->bd_size_lock);
                                local_irq_disable();
                            [2] lock(&d->lock);
                            [3] lock(&bdev->bd_size_lock);
   <Interrupt>
[4]  lock(&d->lock);

  *** DEADLOCK ***

Where [1](&bdev->bd_size_lock) hold by zram_add()->set_capacity().
[2]lock(&d->lock) hold by aoeblk_gdalloc(). And aoeblk_gdalloc()
is trying to acquire [3](&bdev->bd_size_lock) at set_capacity() call.
In this situation an attempt to acquire [4]lock(&d->lock) from
aoecmd_cfg_rsp() will lead to deadlock.

So the simplest solution is breaking lock dependency
[2](&d->lock) -> [3](&bdev->bd_size_lock) by moving set_capacity()
outside.

Signed-off-by: Maksim Kiselev <bigunclemax@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240124072436.3745720-2-bigunclemax@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-24 08:30:43 -07:00
Ilya Dryomov
ded080c86b rbd: don't move requests to the running list on errors
The running list is supposed to contain requests that are pinning the
exclusive lock, i.e. those that must be flushed before exclusive lock
is released.  When wake_lock_waiters() is called to handle an error,
requests on the acquiring list are failed with that error and no
flushing takes place.  Briefly moving them to the running list is not
only pointless but also harmful: if exclusive lock gets acquired
before all of their state machines are scheduled and go through
rbd_lock_del_request(), we trigger

    rbd_assert(list_empty(&rbd_dev->running_list));

in rbd_try_acquire_lock().

Cc: stable@vger.kernel.org
Fixes: 637cd06053 ("rbd: new exclusive lock wait/wake code")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2024-01-22 00:14:10 +01:00
Christophe JAILLET
cd30e8bde2 rbd: remove usage of the deprecated ida_simple_*() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

Note that the upper limit of ida_simple_get() is exclusive, while that
of ida_alloc_max() is inclusive, so 1 has been subtracted.

[ idryomov: tweak changelog ]

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-01-22 00:14:10 +01:00
Linus Torvalds
9d1694dc91 for-6.8/block-2024-01-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+
 zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP
 XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk
 xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R
 qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh
 V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso
 D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx
 T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq
 Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3
 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd
 RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV
 amhv+0PdCw==
 =DLW9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes,
        Christoph)
      - discard fixes and improvements (Christoph)
      - timeout debug improvements (Keith, Max)
      - various cleanups (Daniel, Max, Giuxen)
      - trace event string fixes (Arnd)
      - shadow doorbell setup on reset fix (William)
      - a write zeroes quirk for SK Hynix (Jim)

 - MD pull request via Song:
      - Sparse warning since v6.0 (Bart)
      - /proc/mdstat regression since v6.7 (Yu Kuai)

 - Use symbolic error value (Christian)

 - IO Priority documentation update (Christian)

 - Fix for accessing queue limits without having entered the queue
   (Christoph, me)

 - Fix for loop dio support (Christoph)

 - Move null_blk off deprecated ida interface (Christophe)

 - Ensure nbd initializes full msghdr (Eric)

 - Fix for a regression with the folio conversion, which is now easier
   to hit because of an unrelated change (Matthew)

 - Remove redundant check in virtio-blk (Li)

 - Fix for a potential hang in sbitmap (Ming)

 - Fix for partial zone appending (Damien)

 - Misc changes and fixes (Bart, me, Kemeng, Dmitry)

* tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits)
  Documentation: block: ioprio: Update schedulers
  loop: fix the the direct I/O support check when used on top of block devices
  blk-mq: Remove the hctx 'run' debugfs attribute
  nbd: always initialize struct msghdr completely
  block: Fix iterating over an empty bio with bio_for_each_folio_all
  block: bio-integrity: fix kcalloc() arguments order
  virtio_blk: remove duplicate check if queue is broken in virtblk_done
  sbitmap: remove stale comment in sbq_calc_wake_batch
  block: Correct a documentation comment in blk-cgroup.c
  null_blk: Remove usage of the deprecated ida_simple_xx() API
  block: ensure we hold a queue reference when using queue limits
  blk-mq: rename blk_mq_can_use_cached_rq
  block: print symbolic error name instead of error code
  blk-mq: fix IO hang from sbitmap wakeup race
  nvmet-rdma: avoid circular locking dependency on install_queue()
  nvmet-tcp: avoid circular locking dependency on install_queue()
  nvme-pci: set doorbell config before unquiescing
  block: fix partial zone append completion handling in req_bio_endio()
  block/iocost: silence warning on 'last_period' potentially being unused
  md/raid1: Use blk_opf_t for read and write operations
  ...
2024-01-18 18:22:40 -08:00
Christoph Hellwig
baa7d53607 loop: fix the the direct I/O support check when used on top of block devices
__loop_update_dio only checks the alignment requirement for block backed
file systems, but misses them for the case where the loop device is
created directly on top of another block device.  Due to this creating
a loop device with default option plus the direct I/O flag on a > 512 byte
sector size file system will lead to incorrect I/O being submitted to the
lower block device and a lot of error from the lock layer.  This can
be seen with xfstests generic/563.

Fix the code in __loop_update_dio by factoring the alignment check into
a helper, and calling that also for the struct block_device of a block
device inode.

Also remove the TODO comment talking about dynamically switching between
buffered and direct I/O, which is a would be a recipe for horrible
performance and occasional data loss.

Fixes: 2e5ab5f379 ("block: loop: prepare for supporing direct IO")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240117175901.871796-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-18 08:20:52 -07:00
Eric Dumazet
78fbb92af2 nbd: always initialize struct msghdr completely
syzbot complains that msg->msg_get_inq value can be uninitialized [1]

struct msghdr got many new fields recently, we should always make
sure their values is zero by default.

[1]
 BUG: KMSAN: uninit-value in tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
  tcp_recvmsg+0x686/0xac0 net/ipv4/tcp.c:2571
  inet_recvmsg+0x131/0x580 net/ipv4/af_inet.c:879
  sock_recvmsg_nosec net/socket.c:1044 [inline]
  sock_recvmsg+0x12b/0x1e0 net/socket.c:1066
  __sock_xmit+0x236/0x5c0 drivers/block/nbd.c:538
  nbd_read_reply drivers/block/nbd.c:732 [inline]
  recv_work+0x262/0x3100 drivers/block/nbd.c:863
  process_one_work kernel/workqueue.c:2627 [inline]
  process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
  worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
  kthread+0x3ed/0x540 kernel/kthread.c:388
  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

Local variable msg created at:
  __sock_xmit+0x4c/0x5c0 drivers/block/nbd.c:513
  nbd_read_reply drivers/block/nbd.c:732 [inline]
  recv_work+0x262/0x3100 drivers/block/nbd.c:863

CPU: 1 PID: 7465 Comm: kworker/u5:1 Not tainted 6.7.0-rc7-syzkaller-00041-gf016f7547aee #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
Workqueue: nbd5-recv recv_work

Fixes: f94fd25cb0 ("tcp: pass back data left in socket after receive")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: nbd@other.debian.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240112132657.647112-1-edumazet@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17 08:47:46 -07:00
Li RongQing
04036d49c4 virtio_blk: remove duplicate check if queue is broken in virtblk_done
virtqueue_enable_cb() will call virtqueue_poll() which will check if
queue is broken at beginning, so remove the virtqueue_is_broken() call

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-15 09:33:41 -07:00
Christophe JAILLET
95931a245b null_blk: Remove usage of the deprecated ida_simple_xx() API
ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

This is less verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/bf257b1078475a415cdc3344c6a750842946e367.1705222845.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-14 07:37:44 -07:00
Linus Torvalds
4c72e2b8c4 for-6.8/io_uring-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcOk0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq2wEACgP1Gvfb2cR65B/6tkLJ6neKnYOPVSYx9F
 4Mv6KS306vmOsE67LqynhtAe/Hgfv0NKADN+mKLLV8IdwhneLAwFxRoHl8QPO2dW
 DIxohDMLIqzY/SOUBqBHMS8b5lEr79UVh8vXjAuuYCBTcHBndi4hj0S0zkwf5YiK
 Vma6ty+TnzmOv+c6+OCgZZlwFhkxD1pQBM5iOlFRAkdt3Er/2e8FqthK0/od7Fgy
 Ii9xdxYZU9KTsM4R+IWhqZ1rj1qUdtMPSGzFRbzFt3b6NdANRZT9MUlxvSkuHPIE
 P/9anpmgu41gd2JbHuyVnw+4jdZMhhZUPUYtphmQ5n35rcFZjv1gnJalH/Nw/DTE
 78bDxxP0+35bkuq5MwHfvA3imVsGnvnFx5MlZBoqvv+VK4S3Q6E2+ARQZdiG+2Is
 8Uyjzzt0nq2waUO3H1JLkgiM9J9LFqaYQLE68u569m4NCfRSpoZva9KVmNpre2K0
 9WdGy2gfzaYAQkGud2TULzeLiEbWfvZAL0o43jYXkVPRKmnffCEphf2X6C2IDV/3
 5ZR4bandRRC6DVnE+8n1MB06AYtpJPB/w3rplON+l/V5Gnb7wRNwiUrBr/F15OOy
 OPbAnP6k56wY/JRpgzdbNqwGi8EvGWX6t3Kqjp2mczhs2as8M193FKS2xppl1TYl
 BPyTsAfbdQ==
 =5v/U
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Mostly just come fixes and cleanups, but one feature as well. In
  detail:

   - Harden the check for handling IOPOLL based on return (Pavel)

   - Various minor optimizations (Pavel)

   - Drop remnants of SCM_RIGHTS fd passing support, now that it's no
     longer supported since 6.7 (me)

   - Fix for a case where bytes_done wasn't initialized properly on a
     failure condition for read/write requests (me)

   - Move the register related code to a separate file (me)

   - Add support for returning the provided ring buffer head (me)

   - Add support for adding a direct descriptor to the normal file table
     (me, Christian Brauner)

   - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is
     run even if we timeout waiting (me)"

* tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux:
  io_uring: ensure local task_work is run on wait timeout
  io_uring/kbuf: add method for returning provided buffer ring head
  io_uring/rw: ensure io->bytes_done is always initialized
  io_uring: drop any code related to SCM_RIGHTS
  io_uring/unix: drop usage of io_uring socket
  io_uring/register: move io_uring_register(2) related code to register.c
  io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL
  io_uring/cmd: inline io_uring_cmd_get_task
  io_uring/cmd: inline io_uring_cmd_do_in_task_lazy
  io_uring: split out cmd api into a separate header
  io_uring: optimise ltimeout for inline execution
  io_uring: don't check iopoll if request completes
2024-01-11 14:19:23 -08:00
Linus Torvalds
01d550f0fc for-6.8/block-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE
 NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO
 DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR
 wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy
 rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L
 HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r
 pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz
 sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G
 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r
 vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP
 N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl
 Vrj5oAiE3w==
 =YAfp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round this time around. This contains:

   - NVMe updates via Keith:
        - nvme fabrics spec updates (Guixin, Max)
        - nvme target udpates (Guixin, Evan)
        - nvme attribute refactoring (Daniel)
        - nvme-fc numa fix (Keith)

   - MD updates via Song:
        - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
        - Fix raid5 hang issue (Junxiao Bi)
        - Add Yu Kuai as Reviewer of the md subsystem
        - Remove deprecated flavors (Song Liu)
        - raid1 read error check support (Li Nan)
        - Better handle events off-by-1 case (Alex Lyakas)

   - Efficiency improvements for passthrough (Kundan)

   - Support for mapping integrity data directly (Keith)

   - Zoned write fix (Damien)

   - rnbd fixes (Kees, Santosh, Supriti)

   - Default to a sane discard size granularity (Christoph)

   - Make the default max transfer size naming less confusing
     (Christoph)

   - Remove support for deprecated host aware zoned model (Christoph)

   - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
     Bart, Christoph)"

* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
  block: Treat sequential write preferred zone type as invalid
  block: remove disk_clear_zoned
  sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
  drivers/block/xen-blkback/common.h: Fix spelling typo in comment
  blk-cgroup: fix rcu lockdep warning in blkg_lookup()
  blk-cgroup: don't use removal safe list iterators
  block: floor the discard granularity to the physical block size
  mtd_blkdevs: use the default discard granularity
  bcache: use the default discard granularity
  zram: use the default discard granularity
  null_blk: use the default discard granularity
  nbd: use the default discard granularity
  ubd: use the default discard granularity
  block: default the discard granularity to sector size
  bcache: discard_granularity should not be smaller than a sector
  block: remove two comments in bio_split_discard
  block: rename and document BLK_DEF_MAX_SECTORS
  loop: don't abuse BLK_DEF_MAX_SECTORS
  aoe: don't abuse BLK_DEF_MAX_SECTORS
  null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
  ...
2024-01-11 13:58:04 -08:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Kirill A. Shutemov
5e0a760b44 mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08 15:27:15 -08:00
Linus Torvalds
5db8752c3b vfs-6.8.iov_iter
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzBQAKCRCRxhvAZXjc
 ot+3AQCZw1PBD4azVxFMWH76qwlAGoVIFug4+ogKU/iUa4VLygEA2FJh1vLJw5iI
 LpgBEIUTPVkwtzinAW94iJJo1Vr7NAI=
 =p6PB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs iov_iter cleanups from Christian Brauner:
 "This contains a minor cleanup. The patches drop an unused argument
  from import_single_range() allowing to replace import_single_range()
  with import_ubuf() and dropping import_single_range() completely"

* tag 'vfs-6.8.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter: replace import_single_range() with import_ubuf()
  iov_iter: remove unused 'iov' argument from import_single_range()
2024-01-08 11:43:04 -08:00
Linus Torvalds
bb93c5ed45 vfs-6.8.rw
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
 ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
 ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
 =bF+R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs rw updates from Christian Brauner:
 "This contains updates from Amir for read-write backing file helpers
  for stacking filesystems such as overlayfs:

   - Fanotify is currently in the process of introducing pre content
     events. Roughly, a new permission event will be added indicating
     that it is safe to write to the file being accessed. These events
     are used by hierarchical storage managers to e.g., fill the content
     of files on first access.

     During that work we noticed that our current permission checking is
     inconsistent in rw_verify_area() and remap_verify_area().
     Especially in the splice code permission checking is done multiple
     times. For example, one time for the whole range and then again for
     partial ranges inside the iterator.

     In addition, we mostly do permission checking before we call
     file_start_write() except for a few places where we call it after.
     For pre-content events we need such permission checking to be done
     before file_start_write(). So this is a nice reason to clean this
     all up.

     After this series, all permission checking is done before
     file_start_write().

     As part of this cleanup we also massaged the splice code a bit. We
     got rid of a few helpers because we are alredy drowning in special
     read-write helpers. We also cleaned up the return types for splice
     helpers.

   - Introduce generic read-write helpers for backing files. This lifts
     some overlayfs code to common code so it can be used by the FUSE
     passthrough work coming in over the next cycles. Make Amir and
     Miklos the maintainers for this new subsystem of the vfs"

* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: fix __sb_write_started() kerneldoc formatting
  fs: factor out backing_file_mmap() helper
  fs: factor out backing_file_splice_{read,write}() helpers
  fs: factor out backing_file_{read,write}_iter() helpers
  fs: prepare for stackable filesystems backing file helpers
  fsnotify: optionally pass access range in file permission hooks
  fsnotify: assert that file_start_write() is not held in permission hooks
  fsnotify: split fsnotify_perm() into two hooks
  fs: use splice_copy_file_range() inline helper
  splice: return type ssize_t from all helpers
  fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
  fs: move file_start_write() into direct_splice_actor()
  fs: fork splice_file_range() from do_splice_direct()
  fs: create {sb,file}_write_not_started() helpers
  fs: create file_write_started() helper
  fs: create __sb_write_started() helper
  fs: move kiocb_start_write() into vfs_iocb_iter_write()
  fs: move permission hook out of do_iter_read()
  fs: move permission hook out of do_iter_write()
  fs: move file_start_write() into vfs_iter_write()
  ...
2024-01-08 11:11:51 -08:00
liyouhong
e3d7581cb1 drivers/block/xen-blkback/common.h: Fix spelling typo in comment
Fix spelling typo in comment.

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231226095701.172080-1-liyouhong@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-04 16:10:29 -07:00
Christoph Hellwig
3753039def zram: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
724325477f null_blk: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
1e2ab2e8a9 nbd: use the default discard granularity
The discard granularity now defaults to a single sector, so don't set
that value explicitly.  Also don't bother clearing it as a discard
granularity without discard_sectors doesn't mean anything.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
3d77976c3a loop: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits.  Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
3888b2ee62 aoe: don't abuse BLK_DEF_MAX_SECTORS
BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for
the max_sectors limits.  Don't use it to initialize max_hw_setors, which
is a hardware / driver capacility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
9a9525de86 null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
null_blk has some rather odd capping of the max_hw_sectors value to
BLK_DEF_MAX_SECTORS, which doesn't make sense - max_hw_sector is the
hardware limit, and BLK_DEF_MAX_SECTORS despite the confusing name is the
default cap for the max_sectors field used for normal file system I/O.

Remove all the capping, and simply leave it to the block layer or
user to take up or not all of that for file system I/O.

Fixes: ea17fd354c ("null_blk: Allow controlling max_hw_sectors limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
34c7db44b4 loop: don't update discard limits from loop_set_status
loop_set_status doesn't change anything relevant to the discard and
write_zeroes setting, so don't bother calling loop_config_discard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:43:46 -07:00
Randy Dunlap
8aabc11c8f drbd: actlog: fix kernel-doc warnings and spelling
Fix all kernel-doc warnings in drbd_actlog.c:

drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io'
drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io'
drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io'
drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all'

Fix one spelling error (s/ore/or/).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc:  <drbd-dev@lists.linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc:  <linux-block@vger.kernel.org>
Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-22 07:18:35 -07:00
Christoph Hellwig
d73e93b4df block: simplify disk_set_zoned
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
7437bb73f0 block: remove support for the host aware zone model
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):

 - host managed where non-conventional zones there is strict requirement
   to write at the write pointer, or else an error is returned
 - host aware where a write point is maintained if writes always happen
   at it, otherwise it is left in an under-defined state and the
   sequential write preferred zones behave like conventional zones
   (probably very badly performing ones, though)

Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it).  Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.

Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production.  Drop the support before it is too
late.  Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
a971ed8002 virtio_blk: remove the broken zone revalidation support
virtblk_revalidate_zones is called unconditionally from
virtblk_config_changed_work from the virtio config_changed callback.

virtblk_revalidate_zones is a bit odd in that it re-clears the zoned
state for host aware or non-zoned devices, which isn't needed unless the
zoned mode changed - but a zone mode change to a host managed model isn't
handled at all, and virtio_blk also doesn't handle any other config
change except for a capacity change is handled (and even if it was
the upper layers above virtio_blk wouldn't handle it very well).

But even the useful case of a size change that would add or remove
zones isn't handled properly as blk_revalidate_disk_zones expects the
device capacity to cover all zones, but the capacity is only updated
after virtblk_revalidate_zones.

As this code appears to be entirely untested and is getting in the way
remove it for now, but it can be readded in a fixed version with
proper test coverage if needed.

Fixes: 95bfec41bd ("virtio-blk: add support for zoned block devices")
Fixes: f1ba4e674f ("virtio-blk: fix to match virtio spec")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
77360cadaa virtio_blk: cleanup zoned device probing
Move reading and checking the zoned model from virtblk_probe_zoned_device
into the caller, leaving only the code to perform the actual setup for
host managed zoned devices in virtblk_probe_zoned_device.

This allows to share the model reading and sharing between builds with
and without CONFIG_BLK_DEV_ZONED, and improve it for the
!CONFIG_BLK_DEV_ZONED case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Kees Cook
9e4bf6a08d block/rnbd-srv: Check for unlikely string overflow
Since "dev_search_path" can technically be as large as PATH_MAX,
there was a risk of truncation when copying it and a second string
into "full_path" since it was also PATH_MAX sized. The W=1 builds were
reporting this warning:

drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra':
drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=]
  616 |                 snprintf(full_path, PATH_MAX, "%s/%s",
      |                                                   ^~
In function 'rnbd_srv_get_full_path',
    inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096
  616 |                 snprintf(full_path, PATH_MAX, "%s/%s",
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  617 |                          dev_search_path, dev_name);
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~

To fix this, unconditionally check for truncation (as was already done
for the case where "%SESSNAME%" was present).

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/
Cc: Md. Haris Iqbal <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc:  <linux-block@vger.kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13 08:15:48 -07:00
Pavel Begunkov
b66509b849 io_uring: split out cmd api into a separate header
linux/io_uring.h is slowly becoming a rubbish bin where we put
anything exposed to other subsystems. For instance, the task exit
hooks and io_uring cmd infra are completely orthogonal and don't need
each other's definitions. Start cleaning it up by splitting out all
command bits into a new header file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12 07:42:52 -07:00
Sergey Senozhatsky
73829b7134 zram: use kmap_local_page()
Use kmap_local_page() instead of kmap_atomic() which has been
deprecated.

Link: https://lkml.kernel.org/r/20231128083845.848008-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:55 -08:00
Sergey Senozhatsky
2e16898d0d zram: tweak writeback config help
Writeback is for incompressible and idle zram pages.

Link: https://lkml.kernel.org/r/20231115024223.4133148-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:41 -08:00
Sergey Senozhatsky
a7a0350583 zram: split memory-tracking and ac-time tracking
ZRAM_MEMORY_TRACKING enables two features:
- per-entry ac-time tracking
- debugfs interface

The latter one is the reason why memory-tracking depends on DEBUG_FS,
while the former one is used far beyond debugging these days.  Namely
ac-time is used for fine grained writeback of idle entries (pages).

Move ac-time tracking under its own config option so that it can be
enabled (along with writeback) on systems without DEBUG_FS.

[senozhatsky@chromium.org: ifdef fixup, per Dmytro]
  Link: https://lkml.kernel.org/r/20231117013543.540280-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20231115024223.4133148-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:40 -08:00
Jens Axboe
9fd7874c0e
iov_iter: replace import_single_range() with import_ubuf()
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05 11:57:37 +01:00
Jens Axboe
6ac805d138
iov_iter: remove unused 'iov' argument from import_single_range()
It is entirely unused, just get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05 11:57:34 +01:00
Stefan Hajnoczi
b8e0792449 virtio_blk: fix snprintf truncation compiler warning
Commit 4e04005256 ("virtio-blk: support polling I/O") triggers the
following gcc 13 W=1 warnings:

drivers/block/virtio_blk.c: In function ‘init_vq’:
drivers/block/virtio_blk.c:1077:68: warning: ‘%d’ directive output may be truncated writing between 1 and 11 bytes into a region of size 7 [-Wformat-truncation=]
 1077 |                 snprintf(vblk->vqs[i].name, VQ_NAME_LEN, "req_poll.%d", i);
      |                                                                    ^~
drivers/block/virtio_blk.c:1077:58: note: directive argument in the range [-2147483648, 65534]
 1077 |                 snprintf(vblk->vqs[i].name, VQ_NAME_LEN, "req_poll.%d", i);
      |                                                          ^~~~~~~~~~~~~
drivers/block/virtio_blk.c:1077:17: note: ‘snprintf’ output between 11 and 21 bytes into a destination of size 16
 1077 |                 snprintf(vblk->vqs[i].name, VQ_NAME_LEN, "req_poll.%d", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive because the lower bound -2147483648 is
incorrect. The true range of i is [0, num_vqs - 1] where 0 < num_vqs <
65536.

The code mixes int, unsigned short, and unsigned int types in addition
to using "%d" for an unsigned value. Use unsigned short and "%u"
consistently to solve the compiler warning.

Cc: Suwan Kim <suwan.kim027@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312041509.DIyvEt9h-lkp@intel.com/
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20231204140743.1487843-1-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-12-04 09:43:53 -05:00
Supriti Singh
70d85bec8f block/rnbd: use %pe to print errors
While printing error, replace %ld by %pe. %pe prints a string
whereas %ld would print an error code.

Signed-off-by: Supriti Singh <supriti.singh@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231124213422.113449-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27 09:11:12 -07:00
Santosh Pradhan
fadf3dffe5 block/rnbd: add support for REQ_OP_WRITE_ZEROES
Remove REQ_OP_WRITE_SAME in favour of REQ_OP_WRITE_ZEROES.

Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20231124213422.113449-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27 09:11:12 -07:00
Amir Goldstein
269aed7014 fs: move file_start_write() into vfs_iter_write()
All the callers of vfs_iter_write() call file_start_write() just before
calling vfs_iter_write() except for target_core_file's fd_do_rw().

Move file_start_write() from the callers into vfs_iter_write().
fd_do_rw() calls vfs_iter_write() with a non-regular file, so
file_start_write() is a no-op.

This is needed for fanotify "pre content" events.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-11-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24 10:09:51 +01:00
Li Nan
98c598afc2 nbd: pass nbd_sock to nbd_read_reply() instead of index
If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be
krealloc in nbd_add_socket(), and a garbage request is received now, a UAF
may occurs.

  T1
  nbd_ioctl
   __nbd_ioctl
    nbd_add_socket
     blk_mq_freeze_queue
				T2
  				recv_work
  				 nbd_read_reply
  				  sock_xmit
     krealloc config->socks
				   def config->socks

Pass nbd_sock to nbd_read_reply(). And introduce a new function
sock_xmit_recv(), which differs from sock_xmit only in the way it get
socket.

==================================================================
BUG: KASAN: use-after-free in sock_xmit+0x525/0x550
Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779

Workqueue: knbd4-recv recv_work
Call Trace:
 __dump_stack
 dump_stack+0xbe/0xfd
 print_address_description.constprop.0+0x19/0x170
 __kasan_report.cold+0x6c/0x84
 kasan_report+0x3a/0x50
 sock_xmit+0x525/0x550
 nbd_read_reply+0xfe/0x2c0
 recv_work+0x1c2/0x750
 process_one_work+0x6b6/0xf10
 worker_thread+0xdd/0xd80
 kthread+0x30a/0x410
 ret_from_fork+0x22/0x30

Allocated by task 18784:
 kasan_save_stack+0x1b/0x40
 kasan_set_track
 set_alloc_info
 __kasan_kmalloc
 __kasan_kmalloc.constprop.0+0xf0/0x130
 slab_post_alloc_hook
 slab_alloc_node
 slab_alloc
 __kmalloc_track_caller+0x157/0x550
 __do_krealloc
 krealloc+0x37/0xb0
 nbd_add_socket
 +0x2d3/0x880
 __nbd_ioctl
 nbd_ioctl+0x584/0x8e0
 __blkdev_driver_ioctl
 blkdev_ioctl+0x2a0/0x6e0
 block_ioctl+0xee/0x130
 vfs_ioctl
 __do_sys_ioctl
 __se_sys_ioctl+0x138/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x61/0xc6

Freed by task 18784:
 kasan_save_stack+0x1b/0x40
 kasan_set_track+0x1c/0x30
 kasan_set_free_info+0x20/0x40
 __kasan_slab_free.part.0+0x13f/0x1b0
 slab_free_hook
 slab_free_freelist_hook
 slab_free
 kfree+0xcb/0x6c0
 krealloc+0x56/0xb0
 nbd_add_socket+0x2d3/0x880
 __nbd_ioctl
 nbd_ioctl+0x584/0x8e0
 __blkdev_driver_ioctl
 blkdev_ioctl+0x2a0/0x6e0
 block_ioctl+0xee/0x130
 vfs_ioctl
 __do_sys_ioctl
 __se_sys_ioctl+0x138/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x61/0xc6

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-21 07:42:04 -07:00
Chengming Zhou
53f2bca260 block/null_blk: Fix double blk_mq_start_request() warning
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq()
would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request,
which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request().

Then null_queue_rqs() put these requests in the rqlist, return back to
the block layer core, which would try to queue them individually again,
so the warning in blk_mq_start_request() triggered.

Fix it by splitting the null_queue_rq() into two parts: the first is the
preparation of request, the second is the handling of request. We put
the blk_mq_start_request() after the preparation part, which may fail
and return back to the block layer core.

The throttling also belongs to the preparation part, so move it before
blk_mq_start_request(). And change the return type of null_handle_cmd()
to void, since it always return BLK_STS_OK now.

Reported-by:  <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com>
Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/
Fixes: d78bfa1346 ("block/null_blk: add queue_rqs() support")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20 10:26:26 -07:00
Li Nan
c2da049f41 nbd: fix null-ptr-dereference while accessing 'nbd->config'
Memory reordering may occur in nbd_genl_connect(), causing config_refs
to be set to 1 while nbd->config is still empty. Opening nbd at this
time will cause null-ptr-dereference.

   T1                      T2
   nbd_open
    nbd_get_config_unlocked
                 	   nbd_genl_connect
                 	    nbd_alloc_and_init_config
                 	     //memory reordered
                  	     refcount_set(&nbd->config_refs, 1)  // 2
     nbd->config
      ->null point
			     nbd->config = config  // 1

Fix it by adding smp barrier to guarantee the execution sequence.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20 10:16:44 -07:00
Li Nan
3123ac7792 nbd: factor out a helper to get nbd_config without holding 'config_lock'
There are no functional changes, just to make code cleaner and prepare
to fix null-ptr-dereference while accessing 'nbd->config'.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20 10:16:44 -07:00
Li Nan
1b59860540 nbd: fold nbd config initialization into nbd_alloc_config()
There are no functional changes, make the code cleaner and prepare to
fix null-ptr-dereference while accessing 'nbd->config'.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20 10:16:44 -07:00
Li Lingfeng
327462725b nbd: fix uaf in nbd_open
Commit 4af5f2e030 ("nbd: use blk_mq_alloc_disk and
blk_cleanup_disk") cleans up disk by blk_cleanup_disk() and it won't set
disk->private_data as NULL as before. UAF may be triggered in nbd_open()
if someone tries to open nbd device right after nbd_put() since nbd has
been free in nbd_dev_remove().

Fix this by implementing ->free_disk and free private data in it.

Fixes: 4af5f2e030 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231107103435.2074904-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-07 08:14:57 -07:00
Linus Torvalds
77fa2fbe87 vhost,virtio,vdpa: features, fixes, cleanups
vdpa/mlx5:
 	VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
 	new maintainer
 vdpa:
 	support for vq descriptor mappings
 	decouple reset of iotlb mapping from device reset
 
 fixes, cleanups all over the place
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVCUMYPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRp4L0H/RKcnNXPRqzhhBI1XVQ11Z8CO8WjovcmJalu
 ADHNEGmvuWnY79fp9eLiZ4iVaTx1qbzqIB5Q500DJ65jh71W7UQ8ww6CGjNUoRGs
 Zoe4G09WoOf4bvDZZzVV7ml/AzMdsHWSZK8pxY3QI9CsC9Zfp9hg20QYxPylCqYx
 SIJx7w2MkoojfmtOHRx1WUxaQz99yfU4Z0C5PxtRE1HGN6/a1aY0P0CAl5jq8uCK
 U5sCRsfCmP7VKlspeEddMiPA35ADbCiysSobCbwGVQEs5cHpMUX7KWa+oV0tF/PY
 9uyJb2rJy6zG3tXmL4XNib665ZR86HX6qiWRfm2nBQQStuHaJyg=
 =mXgo
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "vhost,virtio,vdpa: features, fixes, cleanups.

  vdpa/mlx5:
   - VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
   - new maintainer

  vdpa:
   - support for vq descriptor mappings
   - decouple reset of iotlb mapping from device reset

  and fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (34 commits)
  vdpa_sim: implement .reset_map support
  vdpa/mlx5: implement .reset_map driver op
  vhost-vdpa: clean iotlb map during reset for older userspace
  vdpa: introduce .compat_reset operation callback
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vdpa: introduce .reset_map operation callback
  virtio_pci: add check for common cfg size
  virtio-blk: fix implicit overflow on virtio_max_dma_size
  virtio_pci: add build offset check for the new common cfg items
  virtio: add definition of VIRTIO_F_NOTIF_CONFIG_DATA feature bit
  vduse: make vduse_class constant
  vhost-scsi: Spelling s/preceeding/preceding/g
  virtio: kdoc for struct virtio_pci_modern_device
  vdpa: Update sysfs ABI documentation
  MAINTAINERS: Add myself as mlx5_vdpa driver
  virtio-balloon: correct the comment of virtballoon_migratepage()
  mlx5_vdpa: offer VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK
  vdpa/mlx5: Update cvq iotlb mapping on ASID change
  vdpa/mlx5: Make iotlb helper functions more generic
  ...
2023-11-05 09:02:32 -10:00
Linus Torvalds
90d624af2e for-6.7/block-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV
 vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v
 4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL
 B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo
 LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2
 ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA
 n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR
 C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7
 nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G
 vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt
 p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy
 WgH1lrtMzg==
 =sace
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Improvements to the queue_rqs() support, and adding null_blk support
   for that as well (Chengming)

 - Series improving badblocks support (Coly)

 - Key store support for sed-opal (Greg)

 - IBM partition string handling improvements (Jan)

 - Make number of ublk devices supported configurable (Mike)

 - Cancelation improvements for ublk (Ming)

 - MD pull requests via Song:
     - Handle timeout in md-cluster, by Denis Plotnikov
     - Cleanup pers->prepare_suspend, by Yu Kuai
     - Rewrite mddev_suspend(), by Yu Kuai
     - Simplify md_seq_ops, by Yu Kuai
     - Reduce unnecessary locking array_state_store(), by Mariusz
       Tkaczyk
     - Make rdev add/remove independent from daemon thread, by Yu Kuai
     - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai

 - NVMe pull request via Keith:
     - nvme-auth updates (Mark)
     - nvme-tcp tls (Hannes)
     - nvme-fc annotaions (Kees)

 - Misc cleanups and improvements (Jiapeng, Joel)

* tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits)
  block: ublk_drv: Remove unused function
  md: cleanup pers->prepare_suspend()
  nvme-auth: allow mixing of secret and hash lengths
  nvme-auth: use transformed key size to create resp
  nvme-auth: alloc nvme_dhchap_key as single buffer
  nvmet-tcp: use 'spin_lock_bh' for state_lock()
  powerpc/pseries: PLPKS SED Opal keystore support
  block: sed-opal: keystore access for SED Opal keys
  block:sed-opal: SED Opal keystore
  ublk: simplify aborting request
  ublk: replace monitor with cancelable uring_cmd
  ublk: quiesce request queue when aborting queue
  ublk: rename mm_lock as lock
  ublk: move ublk_cancel_dev() out of ub->mutex
  ublk: make sure io cmd handled in submitter task context
  ublk: don't get ublk device reference in ublk_abort_queue()
  ublk: Make ublks_max configurable
  ublk: Limit dev_id/ub_number values
  md-cluster: check for timeout while a new disk adding
  nvme: rework NVME_AUTH Kconfig selection
  ...
2023-11-01 12:30:07 -10:00
zhenwei pi
fafb51a67f virtio-blk: fix implicit overflow on virtio_max_dma_size
The following codes have an implicit conversion from size_t to u32:
(u32)max_size = (size_t)virtio_max_dma_size(vdev);

This may lead overflow, Ex (size_t)4G -> (u32)0. Once
virtio_max_dma_size() has a larger size than U32_MAX, use U32_MAX
instead.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20230904061045.510460-1-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-01 09:19:59 -04:00
Linus Torvalds
befaa609f4 hardening updates for v6.7-rc1
- Add LKDTM test for stuck CPUs (Mark Rutland)
 
 - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
 
 - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
 
 - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
 
 - Convert group_info.usage to refcount_t (Elena Reshetova)
 
 - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
 
 - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
 
 - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
 
 - Fix strtomem() compile-time check for small sources (Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
 jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
 yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
 Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
 beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
 ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
 UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
 /SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
 BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
 Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
 BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
 m2VXyBl3121BVtKi4w==
 =0iVF
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:
 "One of the more voluminous set of changes is for adding the new
  __counted_by annotation[1] to gain run-time bounds checking of
  dynamically sized arrays with UBSan.

   - Add LKDTM test for stuck CPUs (Mark Rutland)

   - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)

   - Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
     Silva)

   - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
     Shaikh)

   - Convert group_info.usage to refcount_t (Elena Reshetova)

   - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)

   - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
     Bulwahn)

   - Fix randstruct GCC plugin performance mode to stay in groups (Kees
     Cook)

   - Fix strtomem() compile-time check for small sources (Kees Cook)"

* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
  hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
  reset: Annotate struct reset_control_array with __counted_by
  kexec: Annotate struct crash_mem with __counted_by
  virtio_console: Annotate struct port_buffer with __counted_by
  ima: Add __counted_by for struct modsig and use struct_size()
  MAINTAINERS: Include stackleak paths in hardening entry
  string: Adjust strtomem() logic to allow for smaller sources
  hardening: x86: drop reference to removed config AMD_IOMMU_V2
  randstruct: Fix gcc-plugin performance mode to stay in group
  mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
  drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
  irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
  KVM: Annotate struct kvm_irq_routing_table with __counted_by
  virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
  hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
  sparc: Annotate struct cpuinfo_tree with __counted_by
  isdn: kcapi: replace deprecated strncpy with strscpy_pad
  isdn: replace deprecated strncpy with strscpy
  NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
  nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
  ...
2023-10-30 19:09:55 -10:00
Christoph Hellwig
6e57236ed6
block: move bdev_mark_dead out of disk_check_media_change
disk_check_media_change is mostly called from ->open where it makes
little sense to mark the file system on the device as dead, as we
are just opening it.  So instead of calling bdev_mark_dead from
disk_check_media_change move it into the few callers that are not
in an open instance.  This avoid calling into bdev_mark_dead and
thus taking s_umount with open_mutex held.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:23 +02:00
Jan Kara
eed993a091
zram: Convert to use bdev_open_by_dev()
Convert zram to use bdev_open_by_dev() and pass the handle around.

CC: Minchan Kim <minchan@kernel.org>
CC: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-8-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:17 +02:00
Jan Kara
436d3705bf
xen/blkback: Convert to bdev_open_by_dev()
Convert xen/blkback to use bdev_open_by_dev() and pass the
handle around.

CC: xen-devel@lists.xenproject.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-7-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:17 +02:00
Jan Kara
c2114f11a3
rnbd-srv: Convert to use bdev_open_by_path()
Convert rnbd-srv to use bdev_open_by_path() and pass the handle
around.

CC: Jack Wang <jinpu.wang@ionos.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Acked-by: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-6-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:17 +02:00
Jan Kara
7ac86df899
pktcdvd: Convert to bdev_open_by_dev()
Convert pktcdvd to use bdev_open_by_dev().

Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-5-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:17 +02:00
Jan Kara
75e27d3734
drdb: Convert to use bdev_open_by_path()
Convert drdb to use bdev_open_by_path().

CC: drbd-dev@lists.linbit.com
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-4-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:16 +02:00
Jiapeng Chong
411957553b block: ublk_drv: Remove unused function
The function are defined in the ublk_drv.c file, but not called
elsewhere, so delete the unused function.

drivers/block/ublk_drv.c:1211:20: warning: unused function 'ublk_abort_io_cmds'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6938
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Fixes: b4e1353f46 ("ublk: simplify aborting request")
Reviewed-by: Ming Lei <ming.lei@rehdat.com>
Link: https://lore.kernel.org/r/20231019030444.53680-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19 05:24:02 -06:00
Ming Lei
b4e1353f46 ublk: simplify aborting request
Now ublk_abort_queue() is run exclusively with ublk_queue_rq() and the
ubq_daemon task, so simplify aborting request:

- set UBLK_IO_FLAG_ABORTED in ublk_abort_queue() just for aborting
this request

- abort request in ublk_queue_rq() if ubq->canceling is set

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:56 -06:00
Ming Lei
216c8f5ef0 ublk: replace monitor with cancelable uring_cmd
Monitor work actually introduces one extra context for handling abort, this
way is easy to cause race, and also introduce extra delay when handling
aborting.

Now we start to support cancelable uring_cmd, so use it instead:

1) this cancel callback is either run from the uring cmd submission task
context or called after the io_uring context is exit, so the callback is
run exclusively with ublk_ch_uring_cmd() and __ublk_rq_task_work().

2) the previous patch freezes request queue when calling ublk_abort_queue(),
which is now completely exclusive with ublk_queue_rq() and
ublk_ch_uring_cmd()/__ublk_rq_task_work().

3) in timeout handler, if all IOs are in-flight, then all uring commands
are completed, uring command canceling can't help us to provide forward
progress any more, so call ublk_abort_requests() in timeout handler.

This way simplifies aborting queue, and is helpful for adding new feature,
such as, relax the limit of using single task for handling one queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:56 -06:00
Ming Lei
bd23f6c2c2 ublk: quiesce request queue when aborting queue
So far aborting queue ends request when the ubq daemon is exiting, and
it can be run concurrently with ublk_queue_rq(), this way is fragile and
we depend on the tricky usage of UBLK_IO_FLAG_ABORTED for avoiding such
race.

Quiesce queue when aborting queue, and the two code paths can be run
completely exclusively, then it becomes easier to add new ublk feature,
such as relaxing single same task limit for each queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:56 -06:00
Ming Lei
28dde8c062 ublk: rename mm_lock as lock
Rename mm_lock field of ublk_device as lock, so that this lock can be reused
for protecting access of ub->ub_disk, which will be used for simplifying
ublk_abort_queue() by quiesce queue in next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:56 -06:00
Ming Lei
85248d670b ublk: move ublk_cancel_dev() out of ub->mutex
ublk_cancel_dev() just calls ublk_cancel_queue() to cancel all pending
io commands after ublk request queue is idle. The only protection is just
the read & write of ubq->nr_io_ready and avoid duplicated command cancel,
so add one per-queue lock with cancel flag for providing this protection,
meantime move ublk_cancel_dev() out of ub->mutex.

Then we needn't to call io_uring_cmd_complete_in_task() to cancel
pending command. And the same cancel logic will be re-used for
cancelable uring command.

This patch basically reverts commit ac5902f84b ("ublk: fix AB-BA lockdep warning").

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:55 -06:00
Ming Lei
3421c7f68b ublk: make sure io cmd handled in submitter task context
In well-done ublk server implementation, ublk io command won't be
linked into any link chain. Meantime they are always handled in no-wait
style, so basically io cmd is always handled in submitter task context.

However, the server may set IOSQE_ASYNC, or io command is linked to one
chain mistakenly, then we may still run into io-wq context and
ctx->uring_lock isn't held.

So in case of IO_URING_F_UNLOCKED, schedule this command by
io_uring_cmd_complete_in_task to force running it in submitter task. Then
ublk_ch_uring_cmd_local() is guaranteed to run with context uring_lock held,
and we needn't to worry about sync among submission code path any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:55 -06:00
Ming Lei
8ed90e370f ublk: don't get ublk device reference in ublk_abort_queue()
ublk_abort_queue() is called in ublk_daemon_monitor_work(), in which
it is guaranteed that the device is live because monitor work is
canceled when removing device, so no need to get the device reference.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231009093324.957829-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:55 -06:00
Mike Christie
6eba24aeb5 ublk: Make ublks_max configurable
We are converting tcmu applications to ublk, but have systems with up
to 1k devices. This patch allows us to configure the ublks_max from
userspace with the ublks_max modparam.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231012150600.6198-3-michael.christie@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:24 -06:00
Mike Christie
56974c07f2 ublk: Limit dev_id/ub_number values
The dev_id/ub_number is used for the ublk dev's char device's minor
number so it has to fit into MINORMASK. This patch adds checks to prevent
userspace from passing a number that's too large and limits what can be
allocated by the ublk_index_idr for the case where userspace has the
kernel allocate the dev_id/ub_number.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231012150600.6198-2-michael.christie@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 08:27:24 -06:00
Linus Torvalds
fc5b94f1cb block-6.6-2023-10-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmUgNUIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppleEADNnAobhJCLJWw2uSwkfIqap4pL+LFvA0B5
 qdhOJb1U7t3mmu6/VhSEpvRFQiSvMHzhXBWHepQfii0Iu7ubkOwlEhPEirZJLBdo
 2Qt6RPxijfhKRHbrujNzjrCUAIti8zchaBrYmYQtB7paKCtTStPLOezBgfOutFl7
 TRLF+GBpsISuQUASSIFssxf7RpZTzS9pdlrS09BeViROdwwqYHXNUIl0kxEFu9BM
 SjElcby5ZJj5drg0SRl5ePG/XCp4Sr/rFDQJI2S/d++qwJS8egsYPQAe7Tbpw4y5
 9FBhs5hKypRkdTjbOVa23lqdfHcSIEyRM2lUL12vaVMA5m+mZ7SePt9qX4Vbg074
 2QJ1YVm4c/UFe6uieUcmDgJNgUMEyeEyo4SUj9gl/gn8wQE1dz5qB5PF1YWEcHa7
 KJ1zBbxBKw4ETLBdt24wCHH8HyA7SO8RCk8EMxFv7ENvQSzXzo+mdvrUYpp04Mu8
 eNV8VvqpMvBuWuh9uAO4CS1GIFQW7DijrProbyh9aOthu7850/ahnwS3b6+jkZYP
 3A7vaHKpU3VyGLYULAE9lZhmeCxmFNHzClgzX2Wve4Z94Xoq7Sa8xksV3JBx5NzY
 5Rt5rfdTZArt+LPL4CrdwlxEnT3F5yTSKXp82lWo6P1ds/x7WEYXKf0B4OgAGANk
 6tk9rK3ZBA==
 =NWf+
 -----END PGP SIGNATURE-----

Merge tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Just two minor fixes, for nbd and md"

* tag 'block-6.6-2023-10-06' of git://git.kernel.dk/linux:
  nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl
  md/raid5: release batch_last before waiting for another stripe_head
2023-10-06 15:43:19 -07:00
Christoph Hellwig
07a1141ff1 nbd: don't call blk_mark_disk_dead nbd_clear_sock_ioctl
blk_mark_disk_dead is the proper interface to shut down a block
device, but it also makes the disk unusable forever.

nbd_clear_sock_ioctl on the other hand wants to shut down the file
system, but allow the block device to be used again when when connected
to another socket.  Switch nbd to use disk_force_media_change and
nbd_bdev_reset to go back to a behavior of the old __invalidate_device
call, with the added benefit of incrementing the device generation
as there is no guarantee the old content comes back when the device
is reconnected.

Reported-by: Samuel Holland <samuel.holland@sifive.com>
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 0c1c9a27ce ("nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://lore.kernel.org/r/20231003153106.1331363-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03 18:27:44 -06:00
Justin Stitt
5e9b7cfc20 aoe: replace strncpy with strscpy
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

`aoe_iflist` is expected to be NUL-terminated which is evident by its
use with string apis later on like `strspn`:
| 	p = aoe_iflist + strspn(aoe_iflist, WHITESPACE);

It also seems `aoe_iflist` does not need to be NUL-padded which means
`strscpy` [2] is a suitable replacement due to the fact that it
guarantees NUL-termination on the destination buffer while not
unnecessarily NUL-padding.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Xu Panda <xu.panda@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-aoe-aoenet-c-v2-1-3d5d158410e9@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03 18:23:48 -06:00
Justin Stitt
e1f2760ba3 null_blk: replace strncpy with strscpy
`strncpy` is deprecated for use on NUL-terminated destination strings [1].

We should favor a more robust and less ambiguous interface.

We expect that both `nullb->disk_name` and `disk->disk_name` be
NUL-terminated:
|     snprintf(nullb->disk_name, sizeof(nullb->disk_name),
|              "%s", config_item_name(&dev->group.cg_item));
...
|       pr_info("disk %s created\n", nullb->disk_name);

It seems like NUL-padding may be required due to __assign_disk_name()
utilizing a memcpy as opposed to a `str*cpy` api.
| static inline void __assign_disk_name(char *name, struct gendisk *disk)
| {
| 	if (disk)
| 		memcpy(name, disk->disk_name, DISK_NAME_LEN);
| 	else
| 		memset(name, 0, DISK_NAME_LEN);
| }

Then we go and print it with `__print_disk_name` which wraps `nullb_trace_disk_name()`.
| #define __print_disk_name(name) nullb_trace_disk_name(p, name)

This function obviously expects a NUL-terminated string.
| const char *nullb_trace_disk_name(struct trace_seq *p, char *name)
| {
| 	const char *ret = trace_seq_buffer_ptr(p);
|
| 	if (name && *name)
| 		trace_seq_printf(p, "disk=%s, ", name);
| 	trace_seq_putc(p, 0);
|
| 	return ret;
| }

>From the above, we need both 1) a NUL-terminated string and 2) a
NUL-padded string. So, let's use strscpy_pad() as per Kees' suggestion
from v1.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230919-strncpy-drivers-block-null_blk-main-c-v3-1-10cf0a87a2c3@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03 18:23:02 -06:00
Kees Cook
150849c5e2 drbd: Annotate struct fifo_buffer with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

As found with Coccinelle[1], add __counted_by for struct fifo_buffer.

[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200316.never.707-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-10-02 09:48:52 -07:00
Ilya Dryomov
0b207d02bd rbd: take header_rwsem in rbd_dev_refresh() only when updating
rbd_dev_refresh() has been holding header_rwsem across header and
parent info read-in unnecessarily for ages.  With commit 870611e487
("rbd: get snapshot context after exclusive lock is ensured to be
held"), the potential for deadlocks became much more real owning to
a) header_rwsem now nesting inside lock_rwsem and b) rw_semaphores
not allowing new readers after a writer is registered.

For example, assuming that I/O request 1, I/O request 2 and header
read-in request all target the same OSD:

1. I/O request 1 comes in and gets submitted
2. watch error occurs
3. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
   releases lock_rwsem
4. after reestablishing the watch, rbd_reregister_watch() calls
   rbd_dev_refresh() which takes header_rwsem for write and submits
   a header read-in request
5. I/O request 2 comes in: after taking lock_rwsem for read in
   __rbd_img_handle_request(), it blocks trying to take header_rwsem
   for read in rbd_img_object_requests()
6. another watch error occurs
7. rbd_watch_errcb() blocks trying to take lock_rwsem for write
8. I/O request 1 completion is received by the messenger but can't be
   processed because lock_rwsem won't be granted anymore
9. header read-in request completion can't be received, let alone
   processed, because the messenger is stranded

Change rbd_dev_refresh() to take header_rwsem only for actually
updating rbd_dev->header.  Header and parent info read-in don't need
any locking.

Cc: stable@vger.kernel.org # 0b035401c570: rbd: move rbd_dev_refresh() definition
Cc: stable@vger.kernel.org # 510a7330c82a: rbd: decouple header read-in from updating rbd_dev->header
Cc: stable@vger.kernel.org # c10311776f0a: rbd: decouple parent info read-in from updating rbd_dev
Cc: stable@vger.kernel.org
Fixes: 870611e487 ("rbd: get snapshot context after exclusive lock is ensured to be held")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-09-26 10:33:19 +02:00
Ilya Dryomov
c10311776f rbd: decouple parent info read-in from updating rbd_dev
Unlike header read-in, parent info read-in is already decoupled in
get_parent_info(), but it's buried in rbd_dev_v2_parent_info() along
with the processing logic.

Separate the initial read-in and update read-in logic into
rbd_dev_setup_parent() and rbd_dev_update_parent() respectively and
have rbd_dev_v2_parent_info() just populate struct parent_image_info
(i.e. what get_parent_info() did).  Some existing QoI issues, like
flatten of a standalone clone being disregarded on refresh, remain.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-09-26 10:31:33 +02:00
Ilya Dryomov
510a7330c8 rbd: decouple header read-in from updating rbd_dev->header
Make rbd_dev_header_info() populate a passed struct rbd_image_header
instead of rbd_dev->header and introduce rbd_dev_update_header() for
updating mutable fields in rbd_dev->header upon refresh.  The initial
read-in of both mutable and immutable fields in rbd_dev_image_probe()
passes in rbd_dev->header so no update step is required there.

rbd_init_layout() is now called directly from rbd_dev_image_probe()
instead of individually in format 1 and format 2 implementations.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-09-26 10:31:31 +02:00
Ilya Dryomov
0b035401c5 rbd: move rbd_dev_refresh() definition
Move rbd_dev_refresh() definition further down to avoid having to
move struct parent_image_info definition in the next commit.  This
spares some forward declarations too.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-09-26 10:31:15 +02:00
Chengming Zhou
d78bfa1346 block/null_blk: add queue_rqs() support
Add batched mq_ops.queue_rqs() support in null_blk for testing. The
implementation is much easy since null_blk doesn't have commit_rqs().

We simply handle each request one by one, if errors are encountered,
leave them in the passed in list and return back.

There is about 3.6% improvement in IOPS of fio/t/io_uring on null_blk
with hw_queue_depth=256 on my test VM, from 1.09M to 1.13M.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-6-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
217b613a53 blk-mq: update driver tags request table when start request
Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.

Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Linus Torvalds
7402e635ed block-6.6-2023-09-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmT7NSUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuIuEACxQj0JjIVwgp8VlUcMAEFap7GH940qs8is
 tnRrGS48G/z3WZ0zizGQ1hF2m9oOJ1NwZ1bBrGJlVBP8KSgyR9IxI6WVt0JBZgg8
 DNQt+v5FoX4v5ZbFr/NTqz/6YnPWVwyTkJs4JIi9YLmpKkd58N+51n3wl56xyrp7
 8OcjHPH1ZI0qGar2poPXI7EKaweSzJW7nr3/cHujYoCfAGn7eZbHNgY0wdAtWdhM
 oGFABgMKTFKA0rCSaP8uo7GIqov+KXvFUNOuC6eNjtaUvXMecPLEJxvDhhi52fml
 LEH1a9bmtazu6bufrbI2D4SmXLNBcqnQw5d/1pn9oRO/UU6AAi6E7I/L1iodKWav
 bkr2i+YMnhUqVDFRwy+j78bcGRlLPWupvj68RdpvAvzhdW/85p9nH66CJfTnjQ3q
 xJJKoUREd/8C6ktm5xHYyETxcL5q6Vdz1w5GmsCyOWWpsXsEbh0GHn6pRDkLWJf4
 OQhO4CbxIARGE5NRK8MYl/J/581yS1cIdGZAg7hsVHzTGDSt3h/Hlh6ONpWCcvXD
 SJI39W83oBSOnsbRG4FU6Bnt/WHTD0yi07cr867l9HEq63JOK/OduMjXA/tdRfwP
 fZU3RfFYcS8+h9UHgrSgvCk03fCjq1HsWCDfpc6sUuRG1fH6jSNFdkCKqhCoU9Py
 K2IKi1IE0Q==
 =4SLW
 -----END PGP SIGNATURE-----

Merge tag 'block-6.6-2023-09-08' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix null_blk polled IO timeout handling (Chengming)

 - Regression fix for swapped arguments in drbd bvec_set_page()
   (Christoph)

 - String length handling fix for s390 dasd (Heiko)

 - Fixes for blk-throttle accounting (Yu)

 - Fix page pinning issue for same page segments (Christoph)

 - Remove redundant file_remove_privs() call (Christoph)

 - Fix a regression in partition handling for devices not supporting
   partitions (Li)

* tag 'block-6.6-2023-09-08' of git://git.kernel.dk/linux:
  drbd: swap bvec_set_page len and offset
  block: fix pin count management when merging same-page segments
  null_blk: fix poll request timeout handling
  s390/dasd: fix string length handling
  block: don't add or resize partition on the disk with GENHD_FL_NO_PART
  block: remove the call to file_remove_privs in blkdev_write_iter
  blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()
  blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()
  blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negative
  blk-throttle: print signed value 'carryover_bytes/ios' for user
2023-09-08 21:39:54 -07:00
Linus Torvalds
7ba2090ca6 Mixed with some fixes and cleanups, this brings in reasonably complete
fscrypt support to CephFS!  The list of things which don't work with
 encryption should be fairly short, mostly around the edges: fallocate
 (not supported well in CephFS to begin with), copy_file_range (requires
 re-encryption), non-default striping patterns.
 
 This was a multi-year effort principally by Jeff Layton with assistance
 from Xiubo Li, Luís Henriques and others, including several dependant
 changes in the MDS, netfs helper library and fscrypt framework itself.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmT4pl4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5kzB/4sMgzZyUa3T1vA/G2pPvEkyy1qDxsW
 y+o4dDMWA9twcrBVpNuGd54wbXpmO/LAekHEdorjayH+f0zf10MsnP1ePz9WB3NG
 jr7RRujb+Gpd2OFYJXGSEbd3faTg8M2kpGCCrVe7SFNoyu8z9NwFItwWMog5aBjX
 ODGQrq+kA4ARA6xIqwzF5gP0zr+baT9rWhQdm7Xo9itWdosnbyDLJx1dpEfLuqBX
 te3SmifDzedn3Gw73hdNo/+ybw0kHARoK+RmXCTsoDDQw+JsoO9KxZF5Q8QcDELq
 2woPNp0Hl+Dm4MkzGnPxv56Qj8ZDViS59syXC0CfGRmu4nzF1Rw+0qn5
 =/WlE
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Mixed with some fixes and cleanups, this brings in reasonably complete
  fscrypt support to CephFS! The list of things which don't work with
  encryption should be fairly short, mostly around the edges: fallocate
  (not supported well in CephFS to begin with), copy_file_range
  (requires re-encryption), non-default striping patterns.

  This was a multi-year effort principally by Jeff Layton with
  assistance from Xiubo Li, Luís Henriques and others, including several
  dependant changes in the MDS, netfs helper library and fscrypt
  framework itself"

* tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client: (53 commits)
  ceph: make num_fwd and num_retry to __u32
  ceph: make members in struct ceph_mds_request_args_ext a union
  rbd: use list_for_each_entry() helper
  libceph: do not include crypto/algapi.h
  ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper
  ceph: fix updating i_truncate_pagecache_size for fscrypt
  ceph: wait for OSD requests' callbacks to finish when unmounting
  ceph: drop messages from MDS when unmounting
  ceph: update documentation regarding snapshot naming limitations
  ceph: prevent snapshot creation in encrypted locked directories
  ceph: add support for encrypted snapshot names
  ceph: invalidate pages when doing direct/sync writes
  ceph: plumb in decryption during reads
  ceph: add encryption support to writepage and writepages
  ceph: add read/modify/write to ceph_sync_write
  ceph: align data in pages in ceph_sync_write
  ceph: don't use special DIO path for encrypted inodes
  ceph: add truncate size handling support for fscrypt
  ceph: add object version support for sync read
  libceph: allow ceph_osdc_new_request to accept a multi-op read
  ...
2023-09-06 12:10:15 -07:00
Christoph Böhmwalder
4b9c2edaf7 drbd: swap bvec_set_page len and offset
bvec_set_page has the following signature:

static inline void bvec_set_page(struct bio_vec *bv, struct page *page,
		unsigned int len, unsigned int offset)

However, the usage in DRBD swaps the len and offset parameters. This
leads to a bvec with length=0 instead of the intended length=4096, which
causes sock_sendmsg to return -EIO.

This leaves DRBD unable to transmit any pages and thus completely
broken.

Swapping the parameters fixes the regression.

Fixes: eeac7405c7 ("drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()")
Reported-by: Serguei Ivantsov <manowar@gsc-game.com>
Link: https://lore.kernel.org/regressions/CAKH+VT3YLmAn0Y8=q37UTDShqxDLsqPcQ4hBMzY7HPn7zNx+RQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230906133034.948817-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-06 07:33:03 -06:00
Chengming Zhou
5a26e45edb null_blk: fix poll request timeout handling
When doing io_uring benchmark on /dev/nullb0, it's easy to crash the
kernel if poll requests timeout triggered, as reported by David. [1]

BUG: kernel NULL pointer dereference, address: 0000000000000008
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:null_timeout_rq+0x4e/0x91
Call Trace:
 ? null_timeout_rq+0x4e/0x91
 blk_mq_handle_expired+0x31/0x4b
 bt_iter+0x68/0x84
 ? bt_tags_iter+0x81/0x81
 __sbitmap_for_each_set.constprop.0+0xb0/0xf2
 ? __blk_mq_complete_request_remote+0xf/0xf
 bt_for_each+0x46/0x64
 ? __blk_mq_complete_request_remote+0xf/0xf
 ? percpu_ref_get_many+0xc/0x2a
 blk_mq_queue_tag_busy_iter+0x14d/0x18e
 blk_mq_timeout_work+0x95/0x127
 process_one_work+0x185/0x263
 worker_thread+0x1b5/0x227

This is indeed a race problem between null_timeout_rq() and null_poll().

null_poll()				null_timeout_rq()
  spin_lock(&nq->poll_lock)
  list_splice_init(&nq->poll_list, &list)
  spin_unlock(&nq->poll_lock)

  while (!list_empty(&list))
    req = list_first_entry()
    list_del_init()
    ...
    blk_mq_add_to_batch()
    // req->rq_next = NULL
					spin_lock(&nq->poll_lock)

					// rq->queuelist->next == NULL
					list_del_init(&rq->queuelist)

					spin_unlock(&nq->poll_lock)

Fix these problems by setting requests state to MQ_RQ_COMPLETE under
nq->poll_lock protection, in which null_timeout_rq() can safely detect
this race and early return.

Note this patch just fix the kernel panic when request timeout happen.

[1] https://lore.kernel.org/all/3893581.1691785261@warthog.procyon.org.uk/

Fixes: 0a593fbbc2 ("null_blk: poll queue support")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230901120306.170520-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-01 08:18:25 -06:00
Jinjie Ruan
cd59cdefc2 rbd: use list_for_each_entry() helper
Convert list_for_each() to list_for_each_entry() so that the tmp
list_head pointer and list_entry() call are no longer needed, which
can reduce a few lines of code. No functional changed.

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-30 11:51:55 +02:00
Linus Torvalds
3d3dfeb3ae for-6.6/block-2023-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
 tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
 lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
 pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
 SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
 l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
 krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
 sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
 tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
 AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
 d0Y/zCZf1A==
 =p1bd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round for this release. This contains:

   - Add support for zoned storage to ublk (Andreas, Ming)

   - Series improving performance for drivers that mark themselves as
     needing a blocking context for issue (Bart)

   - Cleanup the flush logic (Chengming)

   - sed opal keyring support (Greg)

   - Fixes and improvements to the integrity support (Jinyoung)

   - Add some exports for bcachefs that we can hopefully delete again in
     the future (Kent)

   - deadline throttling fix (Zhiguo)

   - Series allowing building the kernel without buffer_head support
     (Christoph)

   - Sanitize the bio page adding flow (Christoph)

   - Write back cache fixes (Christoph)

   - MD updates via Song:
      - Fix perf regression for raid0 large sequential writes (Jan)
      - Fix split bio iostat for raid0 (David)
      - Various raid1 fixes (Heinz, Xueshi)
      - raid6test build fixes (WANG)
      - Deprecate bitmap file support (Christoph)
      - Fix deadlock with md sync thread (Yu)
      - Refactor md io accounting (Yu)
      - Various non-urgent fixes (Li, Yu, Jack)

   - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
     Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"

* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
  block: use strscpy() to instead of strncpy()
  block: sed-opal: keyring support for SED keys
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  blk-mq: prealloc tags when increase tagset nr_hw_queues
  blk-mq: delete redundant tagset map update when fallback
  blk-mq: fix tags leak when shrink nr_hw_queues
  ublk: zoned: support REQ_OP_ZONE_RESET_ALL
  md: raid0: account for split bio in iostat accounting
  md/raid0: Fix performance regression for large sequential writes
  md/raid0: Factor out helper for mapping and submitting a bio
  md raid1: allow writebehind to work on any leg device set WriteMostly
  md/raid1: hold the barrier until handle_read_error() finishes
  md/raid1: free the r1bio before waiting for blocked rdev
  md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
  blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
  drivers/rnbd: restore sysfs interface to rnbd-client
  md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
  raid6: test: only check for Altivec if building on powerpc hosts
  raid6: test: make sure all intermediate and artifact files are .gitignored
  ...
2023-08-29 20:21:42 -07:00
Linus Torvalds
bd6c11bc43 Networking changes for 6.6.
Core
 ----
 
  - Increase size limits for to-be-sent skb frag allocations. This
    allows tun, tap devices and packet sockets to better cope with large
    writes operations.
 
  - Store netdevs in an xarray, to simplify iterating over netdevs.
 
  - Refactor nexthop selection for multipath routes.
 
  - Improve sched class lifetime handling.
 
  - Add backup nexthop ID support for bridge.
 
  - Implement drop reasons support in openvswitch.
 
  - Several data races annotations and fixes.
 
  - Constify the sk parameter of routing functions.
 
  - Prepend kernel version to netconsole message.
 
 Protocols
 ---------
 
  - Implement support for TCP probing the peer being under memory
    pressure.
 
  - Remove hard coded limitation on IPv6 specific info placement
    inside the socket struct.
 
  - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
    per socket scaling factor.
 
  - Scaling-up the IPv6 expired route GC via a separated list of
    expiring routes.
 
  - In-kernel support for the TLS alert protocol.
 
  - Better support for UDP reuseport with connected sockets.
 
  - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
    header size.
 
  - Get rid of additional ancillary per MPTCP connection struct socket.
 
  - Implement support for BPF-based MPTCP packet schedulers.
 
  - Format MPTCP subtests selftests results in TAP.
 
  - Several new SMC 2.1 features including unique experimental options,
    max connections per lgr negotiation, max links per lgr negotiation.
 
 BPF
 ---
 
  - Multi-buffer support in AF_XDP.
 
  - Add multi uprobe BPF links for attaching multiple uprobes
    and usdt probes, which is significantly faster and saves extra fds.
 
  - Implement an fd-based tc BPF attach API (TCX) and BPF link support on
    top of it.
 
  - Add SO_REUSEPORT support for TC bpf_sk_assign.
 
  - Support new instructions from cpu v4 to simplify the generated code and
    feature completeness, for x86, arm64, riscv64.
 
  - Support defragmenting IPv(4|6) packets in BPF.
 
  - Teach verifier actual bounds of bpf_get_smp_processor_id()
    and fix perf+libbpf issue related to custom section handling.
 
  - Introduce bpf map element count and enable it for all program types.
 
  - Add a BPF hook in sys_socket() to change the protocol ID
    from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
 
  - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
 
  - Add uprobe support for the bpf_get_func_ip helper.
 
  - Check skb ownership against full socket.
 
  - Support for up to 12 arguments in BPF trampoline.
 
  - Extend link_info for kprobe_multi and perf_event links.
 
 Netfilter
 ---------
 
  - Speed-up process exit by aborting ruleset validation if a
    fatal signal is pending.
 
  - Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
 
 Driver API
 ----------
 
  - Page pool optimizations, to improve data locality and cache usage.
 
  - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
    for raw ioctl() handling in drivers.
 
  - Simplify genetlink dump operations (doit/dumpit) providing them
    the common information already populated in struct genl_info.
 
  - Extend and use the yaml devlink specs to [re]generate the split ops.
 
  - Introduce devlink selective dumps, to allow SF filtering SF based on
    handle and other attributes.
 
  - Add yaml netlink spec for netlink-raw families, allow route, link and
    address related queries via the ynl tool.
 
  - Remove phylink legacy mode support.
 
  - Support offload LED blinking to phy.
 
  - Add devlink port function attributes for IPsec.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Broadcom ASP 2.0 (72165) ethernet controller
    - MediaTek MT7988 SoC
    - Texas Instruments AM654 SoC
    - Texas Instruments IEP driver
    - Atheros qca8081 phy
    - Marvell 88Q2110 phy
    - NXP TJA1120 phy
 
  - WiFi:
    - MediaTek mt7981 support
 
  - Can:
    - Kvaser SmartFusion2 PCI Express devices
    - Allwinner T113 controllers
    - Texas Instruments tcan4552/4553 chips
 
  - Bluetooth:
    - Intel Gale Peak
    - Qualcomm WCN3988 and WCN7850
    - NXP AW693 and IW624
    - Mediatek MT2925
 
 Drivers
 -------
 
  - Ethernet NICs:
    - nVidia/Mellanox:
      - mlx5:
        - support UDP encapsulation in packet offload mode
        - IPsec packet offload support in eswitch mode
        - improve aRFS observability by adding new set of counters
        - extends MACsec offload support to cover RoCE traffic
        - dynamic completion EQs
      - mlx4:
        - convert to use auxiliary bus instead of custom interface logic
    - Intel
      - ice:
        - implement switchdev bridge offload, even for LAG interfaces
        - implement SRIOV support for LAG interfaces
      - igc:
        - add support for multiple in-flight TX timestamps
    - Broadcom:
      - bnxt:
        - use the unified RX page pool buffers for XDP and non-XDP
        - use the NAPI skb allocation cache
    - OcteonTX2:
      - support Round Robin scheduling HTB offload
      - TC flower offload support for SPI field
    - Freescale:
      -  add XDP_TX feature support
    - AMD:
      - ionic: add support for PCI FLR event
      - sfc:
        - basic conntrack offload
        - introduce eth, ipv4 and ipv6 pedit offloads
    - ST Microelectronics:
      - stmmac: maximze PTP timestamping resolution
 
  - Virtual NICs:
    - Microsoft vNIC:
      - batch ringing RX queue doorbell on receiving packets
      - add page pool for RX buffers
    - Virtio vNIC:
      - add per queue interrupt coalescing support
    - Google vNIC:
      - add queue-page-list mode support
 
  - Ethernet high-speed switches:
    - nVidia/Mellanox (mlxsw):
      - add port range matching tc-flower offload
      - permit enslavement to netdevices with uppers
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - convert to phylink_pcs
    - Renesas:
      - r8A779fx: add speed change support
      - rzn1: enables vlan support
 
  - Ethernet PHYs:
    - convert mv88e6xxx to phylink_pcs
 
  - WiFi:
    - Qualcomm Wi-Fi 7 (ath12k):
      - extremely High Throughput (EHT) PHY support
    - RealTek (rtl8xxxu):
      - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
        RTL8192EU and RTL8723BU
    - RealTek (rtw89):
      - Introduce Time Averaged SAR (TAS) support
 
  - Connector:
    - support for event filtering
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
 9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
 qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
 /QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
 OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
 6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
 ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
 /daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
 Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
 Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
 4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
 DCW8VGFOZjZM
 =9CsR
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Increase size limits for to-be-sent skb frag allocations. This
     allows tun, tap devices and packet sockets to better cope with
     large writes operations

   - Store netdevs in an xarray, to simplify iterating over netdevs

   - Refactor nexthop selection for multipath routes

   - Improve sched class lifetime handling

   - Add backup nexthop ID support for bridge

   - Implement drop reasons support in openvswitch

   - Several data races annotations and fixes

   - Constify the sk parameter of routing functions

   - Prepend kernel version to netconsole message

  Protocols:

   - Implement support for TCP probing the peer being under memory
     pressure

   - Remove hard coded limitation on IPv6 specific info placement inside
     the socket struct

   - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
     socket scaling factor

   - Scaling-up the IPv6 expired route GC via a separated list of
     expiring routes

   - In-kernel support for the TLS alert protocol

   - Better support for UDP reuseport with connected sockets

   - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
     header size

   - Get rid of additional ancillary per MPTCP connection struct socket

   - Implement support for BPF-based MPTCP packet schedulers

   - Format MPTCP subtests selftests results in TAP

   - Several new SMC 2.1 features including unique experimental options,
     max connections per lgr negotiation, max links per lgr negotiation

  BPF:

   - Multi-buffer support in AF_XDP

   - Add multi uprobe BPF links for attaching multiple uprobes and usdt
     probes, which is significantly faster and saves extra fds

   - Implement an fd-based tc BPF attach API (TCX) and BPF link support
     on top of it

   - Add SO_REUSEPORT support for TC bpf_sk_assign

   - Support new instructions from cpu v4 to simplify the generated code
     and feature completeness, for x86, arm64, riscv64

   - Support defragmenting IPv(4|6) packets in BPF

   - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
     perf+libbpf issue related to custom section handling

   - Introduce bpf map element count and enable it for all program types

   - Add a BPF hook in sys_socket() to change the protocol ID from
     IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy

   - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress

   - Add uprobe support for the bpf_get_func_ip helper

   - Check skb ownership against full socket

   - Support for up to 12 arguments in BPF trampoline

   - Extend link_info for kprobe_multi and perf_event links

  Netfilter:

   - Speed-up process exit by aborting ruleset validation if a fatal
     signal is pending

   - Allow NLA_POLICY_MASK to be used with BE16/BE32 types

  Driver API:

   - Page pool optimizations, to improve data locality and cache usage

   - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
     need for raw ioctl() handling in drivers

   - Simplify genetlink dump operations (doit/dumpit) providing them the
     common information already populated in struct genl_info

   - Extend and use the yaml devlink specs to [re]generate the split ops

   - Introduce devlink selective dumps, to allow SF filtering SF based
     on handle and other attributes

   - Add yaml netlink spec for netlink-raw families, allow route, link
     and address related queries via the ynl tool

   - Remove phylink legacy mode support

   - Support offload LED blinking to phy

   - Add devlink port function attributes for IPsec

  New hardware / drivers:

   - Ethernet:
      - Broadcom ASP 2.0 (72165) ethernet controller
      - MediaTek MT7988 SoC
      - Texas Instruments AM654 SoC
      - Texas Instruments IEP driver
      - Atheros qca8081 phy
      - Marvell 88Q2110 phy
      - NXP TJA1120 phy

   - WiFi:
      - MediaTek mt7981 support

   - Can:
      - Kvaser SmartFusion2 PCI Express devices
      - Allwinner T113 controllers
      - Texas Instruments tcan4552/4553 chips

   - Bluetooth:
      - Intel Gale Peak
      - Qualcomm WCN3988 and WCN7850
      - NXP AW693 and IW624
      - Mediatek MT2925

  Drivers:

   - Ethernet NICs:
      - nVidia/Mellanox:
         - mlx5:
            - support UDP encapsulation in packet offload mode
            - IPsec packet offload support in eswitch mode
            - improve aRFS observability by adding new set of counters
            - extends MACsec offload support to cover RoCE traffic
            - dynamic completion EQs
         - mlx4:
            - convert to use auxiliary bus instead of custom interface
              logic
      - Intel
         - ice:
            - implement switchdev bridge offload, even for LAG
              interfaces
            - implement SRIOV support for LAG interfaces
         - igc:
            - add support for multiple in-flight TX timestamps
      - Broadcom:
         - bnxt:
            - use the unified RX page pool buffers for XDP and non-XDP
            - use the NAPI skb allocation cache
      - OcteonTX2:
         - support Round Robin scheduling HTB offload
         - TC flower offload support for SPI field
      - Freescale:
         - add XDP_TX feature support
      - AMD:
         - ionic: add support for PCI FLR event
         - sfc:
            - basic conntrack offload
            - introduce eth, ipv4 and ipv6 pedit offloads
      - ST Microelectronics:
         - stmmac: maximze PTP timestamping resolution

   - Virtual NICs:
      - Microsoft vNIC:
         - batch ringing RX queue doorbell on receiving packets
         - add page pool for RX buffers
      - Virtio vNIC:
         - add per queue interrupt coalescing support
      - Google vNIC:
         - add queue-page-list mode support

   - Ethernet high-speed switches:
      - nVidia/Mellanox (mlxsw):
         - add port range matching tc-flower offload
         - permit enslavement to netdevices with uppers

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - convert to phylink_pcs
      - Renesas:
         - r8A779fx: add speed change support
         - rzn1: enables vlan support

   - Ethernet PHYs:
      - convert mv88e6xxx to phylink_pcs

   - WiFi:
      - Qualcomm Wi-Fi 7 (ath12k):
         - extremely High Throughput (EHT) PHY support
      - RealTek (rtl8xxxu):
         - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
           RTL8192EU and RTL8723BU
      - RealTek (rtw89):
         - Introduce Time Averaged SAR (TAS) support

   - Connector:
      - support for event filtering"

* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
  net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
  net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
  net: stmmac: clarify difference between "interface" and "phy_interface"
  r8152: add vendor/device ID pair for D-Link DUB-E250
  devlink: move devlink_notify_register/unregister() to dev.c
  devlink: move small_ops definition into netlink.c
  devlink: move tracepoint definitions into core.c
  devlink: push linecard related code into separate file
  devlink: push rate related code into separate file
  devlink: push trap related code into separate file
  devlink: use tracepoint_enabled() helper
  devlink: push region related code into separate file
  devlink: push param related code into separate file
  devlink: push resource related code into separate file
  devlink: push dpipe related code into separate file
  devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
  devlink: push shared buffer related code into separate file
  devlink: push port related code into separate file
  devlink: push object register/unregister notifications into separate helpers
  inet: fix IP_TRANSPARENT error handling
  ...
2023-08-29 11:33:01 -07:00
Linus Torvalds
511fb5bafe v6.6-vfs.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc
 oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj
 xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE=
 =2e9I
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock updates from Christian Brauner:
 "This contains the super rework that was ready for this cycle. The
  first part changes the order of how we open block devices and allocate
  superblocks, contains various cleanups, simplifications, and a new
  mechanism to wait on superblock state changes.

  This unblocks work to ultimately limit the number of writers to a
  block device. Jan has already scheduled follow-up work that will be
  ready for v6.7 and allows us to restrict the number of writers to a
  given block device. That series builds on this work right here.

  The second part contains filesystem freezing updates.

  Overview:

  The generic superblock changes are rougly organized as follows
  (ignoring additional minor cleanups):

   (1) Removal of the bd_super member from struct block_device.

       This was a very odd back pointer to struct super_block with
       unclear rules. For all relevant places we have other means to get
       the same information so just get rid of this.

   (2) Simplify rules for superblock cleanup.

       Roughly, everything that is allocated during fs_context
       initialization and that's stored in fs_context->s_fs_info needs
       to be cleaned up by the fs_context->free() implementation before
       the superblock allocation function has been called successfully.

       After sget_fc() returned fs_context->s_fs_info has been
       transferred to sb->s_fs_info at which point sb->kill_sb() if
       fully responsible for cleanup. Adhering to these rules means that
       cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
       brittle and inconsistent.

       Cleanup shouldn't be duplicated between sb->put_super() as
       sb->put_super() is only called if sb->s_root has been set aka
       when the filesystem has been successfully born (SB_BORN). That
       complexity should be avoided.

       This also means that block devices are to be closed in
       sb->kill_sb() instead of sb->put_super(). More details in the
       lower section.

   (3) Make it possible to lookup or create a superblock before opening
       block devices

       There's a subtle dependency on (2) as some filesystems did rely
       on fill_super() to be called in order to correctly clean up
       sb->s_fs_info. All these filesystems have been fixed.

   (4) Switch most filesystem to follow the same logic as the generic
       mount code now does as outlined in (3).

   (5) Use the superblock as the holder of the block device. We can now
       easily go back from block device to owning superblock.

   (6) Export and extend the generic fs_holder_ops and use them as
       holder ops everywhere and remove the filesystem specific holder
       ops.

   (7) Call from the block layer up into the filesystem layer when the
       block device is removed, allowing to shut down the filesystem
       without risk of deadlocks.

   (8) Get rid of get_super().

       We can now easily go back from the block device to owning
       superblock and can call up from the block layer into the
       filesystem layer when the device is removed. So no need to wade
       through all registered superblock to find the owning superblock
       anymore"

Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/

* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
  super: use higher-level helper for {freeze,thaw}
  super: wait until we passed kill super
  super: wait for nascent superblocks
  super: make locking naming consistent
  super: use locking helpers
  fs: simplify invalidate_inodes
  fs: remove get_super
  block: call into the file system for ioctl BLKFLSBUF
  block: call into the file system for bdev_mark_dead
  block: consolidate __invalidate_device and fsync_bdev
  block: drop the "busy inodes on changed media" log message
  dasd: also call __invalidate_device when setting the device offline
  amiflop: don't call fsync_bdev in FDFMTBEG
  floppy: call disk_force_media_change when changing the format
  block: simplify the disk_force_media_change interface
  nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
  xfs use fs_holder_ops for the log and RT devices
  xfs: drop s_umount over opening the log and RT devices
  ext4: use fs_holder_ops for the log device
  ext4: drop s_umount over opening the log device
  ...
2023-08-28 11:04:18 -07:00
Jakub Kicinski
57ce6427e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/net/inet_sock.h
  f866fbc842 ("ipv4: fix data-races around inet->inet_id")
  c274af2242 ("inet: introduce inet->inet_flags")
https://lore.kernel.org/all/679ddff6-db6e-4ff6-b177-574e90d0103d@tessares.net/

Adjacent changes:

drivers/net/bonding/bond_alb.c
  e74216b8de ("bonding: fix macvlan over alb bond support")
  f11e5bd159 ("bonding: support balance-alb with openvswitch")

drivers/net/ethernet/broadcom/bgmac.c
  d6499f0b7c ("net: bgmac: Return PTR_ERR() for fixed_phy_register()")
  23a14488ea ("net: bgmac: Fix return value check for fixed_phy_register()")

drivers/net/ethernet/broadcom/genet/bcmmii.c
  32bbe64a13 ("net: bcmgenet: Fix return value check for fixed_phy_register()")
  acf50d1adb ("net: bcmgenet: Return PTR_ERR() for fixed_phy_register()")

net/sctp/socket.c
  f866fbc842 ("ipv4: fix data-races around inet->inet_id")
  b09bde5c35 ("inet: move inet->mc_loop to inet->inet_frags")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-24 10:51:39 -07:00
Christian Brauner
3fb5a6562a New code for 6.6:
* Allow the kernel to initiate a freeze of a filesystem.  The kernel
    and userspace can both hold a freeze on a filesystem at the same
    time; the freeze is not lifted until /both/ holders lift it.  This
    will enable us to fix a longstanding bug in XFS online fsck.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZLVnJwAKCRBKO3ySh0YR
 pqVIAP9u9CZEJ2Zcc7YpBj1MLUQGr2xBmz8RJEVJbQHKVgYcQwEA9BNb4eH4i2Af
 K7Qp0OGNgyzZw37lN23Uf/SDuBK2QgM=
 =seMl
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXo2AAKCRCRxhvAZXjc
 ojDfAQDguc2saF8WLeXtn2O0pGOW8vTrhpwiFHNI6hwdzf07/AD+LGBpFEqYKyX5
 NHPzdR7YYpJoTsQzR4JFJVZqN9Q1xgU=
 =wDq0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.6-merge-2' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull filesystem freezing updates from Darrick Wong:

New code for 6.6:

 * Allow the kernel to initiate a freeze of a filesystem.  The kernel
   and userspace can both hold a freeze on a filesystem at the same
   time; the freeze is not lifted until /both/ holders lift it.  This
   will enable us to fix a longstanding bug in XFS online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Message-Id: <20230822182604.GB11286@frogsfrogsfrogs>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-23 13:06:55 +02:00
Christoph Hellwig
2c0326c587 amiflop: don't call fsync_bdev in FDFMTBEG
FDFMTBEG is used by fdformat to calibrate before formatting a disk.
Neither the atari nor PC floppy driver sync data, which also seems
a bit pointless for a disk hat is about to get formatted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-11-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:31 +02:00
Christoph Hellwig
a47145f236 floppy: call disk_force_media_change when changing the format
While changing the format of a floppy isn't strictly speaking a media
change, the effects are the same in that the content of the media
changes and the diskseq should be increased and uevent should be
sent.  Switch from calling __invalidate_device to
disk_force_media_change to do so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-10-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:31 +02:00
Christoph Hellwig
ab6860f62b block: simplify the disk_force_media_change interface
Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only
useful use case, and drop the superfluous return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-9-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:30 +02:00
Christoph Hellwig
0c1c9a27ce nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
nbd_clear_sock_ioctl kills the socket and with that the block
device.  Instead of just invalidating file system buffers,
mark the device as dead, which will also invalidate the buffers
as part of the proper shutdown sequence.  This also includes
invalidating partitions if there are any.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-8-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:30 +02:00
Ming Lei
851e06297f ublk: zoned: support REQ_OP_ZONE_RESET_ALL
There isn't any reason to not support REQ_OP_ZONE_RESET_ALL given everything
is actually handled in userspace, not mention it is pretty easy to support
RESET_ALL.

So enable REQ_OP_ZONE_RESET_ALL and let userspace handle it.

Verified by 'tools/zbc_reset_zone -all /dev/ublkb0' in libzbc[1] with
libublk-rs based ublk-zoned target prototype[2], follows command line
for creating ublk-zoned:

	cargo run --example zoned -- add -1 1024	# add $dev_id $DEV_SIZE

[1] https://github.com/westerndigitalcorporation/libzbc
[2] https://github.com/ming1/libublk-rs/tree/zoned.v2

Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810124326.321472-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-20 20:24:34 -06:00
Linus Torvalds
2383ffc41a block-6.5-2023-08-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTg19oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmLjD/wKs0hA4JOQDqZPZK1p1aBU4f0vXwQxFGlL
 +gcnO4/MIB/5Ud+T+SXYuMrimLws7xsVbymcGatiRjH8LTfJVXFhuAzLILi0AcHw
 nhzjOUEzHokUex+tZLZRZxmavR+9SyGJoFNIbh+mY8JOLdNVzFDSqnLWO+D02Q2R
 OOBupA0mLRelYODEm2rI4xlQndwfrOAoAyEv+R7Ug0F6bFSno36QOg64pmZVI0Fl
 eudORXnIRYdtUajv+kNATWoqBbq/UCuBJdk0veM07Try6ZGRXRh6dQSA+GRh93pE
 Zg3JAHj4MKwlP3/wglw3SzoeECHpZrKQavIQQe9pTWKP4xGI/jdbVBcyFE0ERc66
 HijMo6CLeAzpOI1nEv+QhD8ntr4polEiWL4EVLuoXE9fVI1mYzavqmqrsDHeOHeF
 IJHadXZwsTG243msDvqedy0RFBwAkpnK0XdQuDtMnSa7UHwWWbxwUOwO5p4COJ3g
 vmrCfPQr7TTgkOtAXoMnwOZ1troEGxa/2CdUKaTdVG8RkMeM2qy8tmBBTV9Bx6+i
 rwQbB/JJm5SE6DX309TRaR6w+5YiwR6e7ECKx5hdYXia7M3OxlBBvl1NOfiWjWE3
 abC38/FReHLmFKHaDaN2AM1vLy+duc4NEc/yMQ4FDcfj/hUHQCoZBPYUsvlC+a4e
 Ws4qoMLU8A==
 =LnzH
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Main thing here is the fix for the regression in flush handling which
  caused IO hangs/stalls for a few reporters. Hopefully that should all
  be sorted out now. Outside of that, just a few minor fixes for issues
  that were introduced in this cycle"

* tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux:
  blk-mq: release scheduler resource when request completes
  blk-crypto: dynamically allocate fallback profile
  blk-cgroup: hold queue_lock when removing blkg->q_node
  drivers/rnbd: restore sysfs interface to rnbd-client
2023-08-19 17:31:46 +02:00
Li Zhijian
6548fce058 drivers/rnbd: restore sysfs interface to rnbd-client
Commit 137380c0ec renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.

CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18 15:00:39 -06:00
Jakub Kicinski
7ff57803d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/sfc/tc.c
  fa165e1949 ("sfc: don't unregister flow_indr if it was never registered")
  3bf969e88a ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18 12:44:56 -07:00
Li Zhijian
f4283bc7e3 drivers/rnbd: restore sysfs interface to rnbd-client
Commit 137380c0ec renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.

CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-16 09:34:47 -06:00
Jakub Kicinski
bffcc6882a genetlink: remove userhdr from struct genl_info
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Ruan Jinjie
66a6a5d0ec ublk: Switch to memdup_user_nul() helper
Use memdup_user_nul() helper instead of open-coding
to simplify the code.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230815114815.1551171-1-ruanjinjie@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-15 10:06:54 -06:00
Linus Torvalds
360e694282 block-6.5-2023-08-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTWfLQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg3nEACROhaeX6cpeDCSTqDVDW/ontbyn15eX7ep
 tGPLn/TVtKv2AztIobEinS08MdywqBO/VcB7XkxQV9Ov4JqCHIAKhndWI6/HqD9P
 DH3h6tE5JA8RQlNw1aHRrqWWIl1lpDQI6263um1tB2TuaxRa4xuR560jju0VZzAm
 9541ceKlJT8Qc7yG0aiiCv6Bxz+b6Htv3DqCf1mY2yznl3BpN52RQHKhiA0sfnlF
 WKqNsvSJ9/kz3vJbNpFucO7ch8a7W+MzmBx0vf2ickTBpL/3hbhUOrE7dGeKI9rS
 cWh1HaULWqjnKY1uxF9nnapZxm8QoxkT/5T0DgmprKjwuZivfLASAhYpHBc3mT1S
 eQQ0AK8hqx7sPnPeO/kxWtxM2nzRLkeVd19ClbIwux/zDbRrpHWk2/wgnSUUd3/H
 HBbjbgPWbkgLvTOUKhIA5VPBcgkC1efom1+ePzkH/H4TRRuVJwg6s6utGXdgc1PX
 +B4TA8GtXRH/7L0tsblFyJRmd0Y6G7gYE/yy0DYZTMie3oaWrKx3lmz48AQUtEzh
 DG46VRA4wnthHRlw3mkLP7C6z4PJvK9WWBiK11eZ9VfJMF643FNpXQ3/bviR9pfF
 kXdwYXoi1mlnsQ0VUhu2f+JeV4hHalrjwD/VE2H0E8Ogb4ezmJteLyiZKcw5xwaA
 Hmtmbb7Qxw==
 =+1Vt
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Fixes for request_queue state (Ming)
      - Another uuid quirk (August)

 - RCU poll fix for NVMe (Ming)

 - Fix for an IO stall with polled IO (me)

 - Fix for blk-iocost stats enable/disable accounting (Chengming)

 - Regression fix for large pages for zram (Christoph)

* tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux:
  nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
  blk-iocost: fix queue stats accounting
  block: don't make REQ_POLLED imply REQ_NOWAIT
  block: get rid of unused plug->nowait flag
  zram: take device and not only bvec offset into account
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
  nvme-rdma: fix potential unbalanced freeze & unfreeze
  nvme-tcp: fix potential unbalanced freeze & unfreeze
  nvme: fix possible hang when removing a controller during error recovery
2023-08-11 12:14:08 -07:00
Ming Lei
e24721e441 ublk: fix 'warn: variable dereferenced before check 'req'' from Smatch
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been
done after the request is confirmed as valid.

Actually here, the request should always been true, so add one
WARN_ON_ONCE(!req), meantime move the zone_append check after
checking the request.

Cc: Andreas Hindborg <a.hindborg@samsung.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230811135216.420404-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 08:13:23 -06:00
Arnd Bergmann
d21fed50c5 swim3: mark swim3_init() static
This is the module init function, which by definition is used only
locally, so mark it static to avoid a warning:

drivers/block/swim3.c:1280:5: error: no previous prototype for 'swim3_init' [-Werror=missing-prototypes]

Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 08:32:59 -06:00
Li Zetao
c8659bbb15 ublk: Fix signedness bug returning warning
There are two warnings reported by smatch:

drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn:
	signedness bug returning '(-95)'
drivers/block/ublk_drv.c:963 ublk_setup_iod() warn:
	signedness bug returning '(-5)'

The type of "blk_status_t" is either be a u32 or u8, but this two
functions return a negative value when not supported or failed. Use
the error code of the blk module to fix these warnings.

Fixes: 29802d7ca3 ("ublk: enable zoned storage support")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202308100201.TCRhgdvN-lkp@intel.com/
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230810084836.3535322-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 07:19:23 -06:00
Andreas Hindborg
29802d7ca3 ublk: enable zoned storage support
Add zoned storage support to ublk: report_zones and operations:
 - REQ_OP_ZONE_OPEN
 - REQ_OP_ZONE_CLOSE
 - REQ_OP_ZONE_FINISH
 - REQ_OP_ZONE_RESET
 - REQ_OP_ZONE_APPEND

The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to
communicate ALBA back to the kernel. Therefore ublk must be used with the
user copy feature (UBLK_F_USER_COPY) for zoned storage support to be
available. Without this feature, ublk will not allow zoned storage support.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-4-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:45:53 -06:00
Andreas Hindborg
1a6e88b959 ublk: move check for empty address field on command submission
In preparation for zoned storage support, move the check for empty `addr`
field into the command handler case statement. Note that the check makes no
sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be
set for this command.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230804114610.179530-3-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:45:53 -06:00
Andreas Hindborg
9d4ed6d462 ublk: add helper to check if device supports user copy
This will be used by ublk zoned storage support.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230804114610.179530-2-nmi@metaspace.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:45:53 -06:00
Christoph Hellwig
95848dcb9d zram: take device and not only bvec offset into account
Commit af8b04c637 ("zram: simplify bvec iteration in
__zram_make_request") changed the bio iteration in zram to rely on the
implicit capping to page boundaries in bio_for_each_segment.  But it
failed to care for the fact zram not only care about the page alignment
of the bio payload, but also the page alignment into the device.  For
buffered I/O and swap those are the same, but for direct I/O or kernel
internal I/O like XFS log buffer writes they can differ.

Fix this by open coding bio_for_each_segment and limiting the bvec len
so that it never crosses over a page alignment boundary in the device
in addition to the payload boundary already taken care of by
bio_iter_iovec.

Cc: stable@vger.kernel.org
Fixes: af8b04c637 ("zram: simplify bvec iteration in __zram_make_request")
Reported-by: Dusty Mabe <dusty@dustymabe.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Link: https://lore.kernel.org/r/20230805055537.147835-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-05 16:13:15 -06:00
Ilya Dryomov
9d01e07fd1 rbd: prevent busy loop when requesting exclusive lock
Due to rbd_try_acquire_lock() effectively swallowing all but
EBLOCKLISTED error from rbd_try_lock() ("request lock anyway") and
rbd_request_lock() returning ETIMEDOUT error not only for an actual
notify timeout but also when the lock owner doesn't respond, a busy
loop inside of rbd_acquire_lock() between rbd_try_acquire_lock() and
rbd_request_lock() is possible.

Requesting the lock on EBUSY error (returned by get_lock_owner_info()
if an incompatible lock or invalid lock owner is detected) makes very
little sense.  The same goes for ETIMEDOUT error (might pop up pretty
much anywhere if osd_request_timeout option is set) and many others.

Just fail I/O requests on rbd_dev->acquiring_list immediately on any
error from rbd_try_lock().

Cc: stable@vger.kernel.org # 588159009d5b: rbd: retrieve and check lock owner twice before blocklisting
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-08-02 09:07:34 +02:00
Linus Torvalds
e62e26d3e9 A patch to reduce the potential for erroneous RBD exclusive lock
blocklisting (fencing) with a couple of prerequisites and a fixup to
 prevent metrics from being sent to the MDS even just once after that
 has been disabled by the user.  All marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmTD7MgTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3SBB/4nHdfSQwy0z2+PM766sUKxSlmaRw8X
 4AJyGAIGj5BnHHhtluwLpEYfrh3wfyCRaNYgS64jdsudbUBxPKIIWn2lEFxtyWbC
 w0R2uEc+NGJLJOYfJ+lBP06Q2r6qk7N6OGNy6qLaN+v6xJ8WPw7H3fJBLVhnPgMq
 7lkACRN+0P5Xt6ZJ57kbWWiFQ+vjv7bbDa0P9zMl6uCgoYsIpvrskqygx+gHbdsq
 IcnpsHu3F0ycYAJT5eJ5GcCcThvwbNjWdbJy1fERah7U/LNcX/S3To9V5LPmydOQ
 tYAWMlC/1a99fr+jTYF0Pu5GLUdK0UMKeX04ZN3SKON4pORurpypw3n8
 =rw72
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to reduce the potential for erroneous RBD exclusive lock
  blocklisting (fencing) with a couple of prerequisites and a fixup to
  prevent metrics from being sent to the MDS even just once after that
  has been disabled by the user. All marked for stable"

* tag 'ceph-for-6.5-rc4' of https://github.com/ceph/ceph-client:
  rbd: retrieve and check lock owner twice before blocklisting
  rbd: harden get_lock_owner_info() a bit
  rbd: make get_lock_owner_info() return a single locker or NULL
  ceph: never send metrics if disable_send_metrics is set
2023-07-28 10:47:24 -07:00
Linus Torvalds
818680d154 block-6.5-2023-07-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTD2VcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqzgEADT9lc0fr8wnp2fu/7NhkdsaPtuoUKiH7ju
 qxWExXIzxxKYM6OdCdYho9SvIjtdblraZOGqrEni8KMcKnR6vDd8BjGzPJ2BJQe2
 GhU0VLVqAw2uoaLhWIZ3m+CyKn7rggoCwpTAUFQG7vL6vSW+ExkWs6g4XcrWhq5V
 XCPJnNRc2o6OdFnv1xBNy1YRrP2ME+HaghJdDNxVM4R9hOv4PVvYrW7P2fK1gurD
 o3JgTAQg6//16ZBAMcZDkwuGGFVdvG16EiE/0z3QdP4WeDqSFqqxYtk8m/LqAx31
 FswUI+YDmKJHSfbLCh6pW4j81xH+svVi0LiANdlxCMGg3zouft+wh/KGdQJC/rcb
 ocFgUCOL4P4208wXkuUjtcAI0zmEMPAIoRQirOzDExja0bI4LJrp3GKJ78U4MbDo
 iBM5qgkH+MqwD/jXMOY5MnEEBzqceW7Q5Al6emE093kRenIYulEL6ZFy05vfqG1+
 NuQcvSJ9f0mgTDUGr0bXytshTG4vSHtKZ9UfAgh5FzPBnPYpwRbxlq6i/7+DxU6k
 33FTzNLuRwDSEo33Uj75KwK1glizofYj6lrqYxVw/9jkVAEEzpjr6H8Su1+dTt8L
 ZbXSQKdxsAxGdbIMeA1lUfUmqQHi1ouTLsBzwkceKCeN3UXmzGOnNBXmvoVXSv0A
 v4RNV+WHCA==
 =TD54
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-28' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "A few fixes that should go into the current kernel release, mainly:

   - Set of fixes for dasd (Stefan)

   - Handle interruptible waits returning because of a signal for ublk
     (Ming)"

* tag 'block-6.5-2023-07-28' of git://git.kernel.dk/linux:
  ublk: return -EINTR if breaking from waiting for existed users in DEL_DEV
  ublk: fail to recover device if queue setup is interrupted
  ublk: fail to start device if queue setup is interrupted
  block: Fix a source code comment in include/uapi/linux/blkzoned.h
  s390/dasd: print copy pair message only for the correct error
  s390/dasd: fix hanging device after request requeue
  s390/dasd: use correct number of retries for ERP requests
  s390/dasd: fix hanging device after quiesce/resume
2023-07-28 10:23:41 -07:00
Ming Lei
3e9dce80db ublk: return -EINTR if breaking from waiting for existed users in DEL_DEV
If user interrupts wait_event_interruptible() in ublk_ctrl_del_dev(),
return -EINTR and let user know what happens.

Fixes: 0abe39dec0 ("block: ublk: improve handling device deletion")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Ming Lei
0c0cbd4ebc ublk: fail to recover device if queue setup is interrupted
In ublk_ctrl_end_recovery(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_END_USER_RECOVERY, otherwise kernel oops can be
triggered.

Fixes: c732a852b4 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Ming Lei
53e7d08f6d ublk: fail to start device if queue setup is interrupted
In ublk_ctrl_start_dev(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_START_DEV, otherwise kernel oops can be triggered.

Reported by German when working on qemu-storage-deamon which requires
single thread ublk daemon.

Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Reported-by: German Maglione <gmaglione@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Ilya Dryomov
588159009d rbd: retrieve and check lock owner twice before blocklisting
An attempt to acquire exclusive lock can race with the current lock
owner closing the image:

1. lock is held by client123, rbd_lock() returns -EBUSY
2. get_lock_owner_info() returns client123 instance details
3. client123 closes the image, lock is released
4. find_watcher() returns 0 as there is no matching watcher anymore
5. client123 instance gets erroneously blocklisted

Particularly impacted is mirror snapshot scheduler in snapshot-based
mirroring since it happens to open and close images a lot (images are
opened only for as long as it takes to take the next mirror snapshot,
the same client instance is used for all images).

To reduce the potential for erroneous blocklisting, retrieve the lock
owner again after find_watcher() returns 0.  If it's still there, make
sure it matches the previously detected lock owner.

Cc: stable@vger.kernel.org # f38cb9d9c204: rbd: make get_lock_owner_info() return a single locker or NULL
Cc: stable@vger.kernel.org # 8ff2c64c9765: rbd: harden get_lock_owner_info() a bit
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-07-26 15:08:21 +02:00
Ilya Dryomov
8ff2c64c97 rbd: harden get_lock_owner_info() a bit
- we want the exclusive lock type, so test for it directly
- use sscanf() to actually parse the lock cookie and avoid admitting
  invalid handles
- bail if locker has a blank address

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-07-26 15:08:09 +02:00
Ilya Dryomov
f38cb9d9c2 rbd: make get_lock_owner_info() return a single locker or NULL
Make the "num_lockers can be only 0 or 1" assumption explicit and
simplify the API by getting rid of output parameters in preparation
for calling get_lock_owner_info() twice before blocklisting.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2023-07-26 15:08:08 +02:00
Linus Torvalds
f036d67c02 block-6.5-2023-07-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmS629wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv/7D/99ysE5ZszmjxNOmyy1lGfqtQnaTLuToRsl
 wB16umIPAFfye5r4TV8l9GZuUyI7FU8LySglu0Y0qMKmCp+kJKLh90kB281Co4Dn
 yp1AbqlTorAlG4ElQJBRaQr4kaqqvI2tzeVmFdUhIE1oX2e9OX/O+YKa8k1JfsKI
 oecChQgodlPxX3wusItgiyvZKl2q2+mivg5E6cqiGIgP3uF8fmOQCbio4Vm8ZSxb
 TO8JEfBTiXslR+CvJD3Gi96pzexN1qCUed8/7FDiIUufhETmwqSIOo89GxzGAQ6O
 7o/83IkqgXPHjKLYs3R4/jhHPXZmXmvDZHWIiSg+KLOFqxxWmRPNJ6V6igIBP8SG
 eu5PTA7SDGtvIXePpu38FTPmSiUW7MbGhnjqY8u64Je6MaQ8l28KN7xkFtmxV+n4
 hgB0gr6uKBnXMKZHobk0yJeUUI/L/0ESzbVPDHY8JM/rQCsp1eSNQDpZoVjPWZmg
 lMGYmOq57oPA20LVch7U3gUFhD4CJ7c3e2/EzJdJVjsTveTYieBCEESQErFbMcEr
 VuRZSAGnPyXQ4yF4wG93x4sDye28ZFS/Q9c6Q3DCUxctDkCz4eY1+vmdX+NJXwDA
 aYXCyyKzk18udbKvV0QvTuDTb6PrJDPxbFagCveibPTtP4XDMv1LvpdZPUPJ/HGX
 4xA1mrsGJA==
 =e2OR
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix for loop regressions (Mauricio)

 - Fix a potential stall with batched wakeups in sbitmap (David)

 - Fix for stall with recursive plug flushes (Ross)

 - Skip accounting of empty requests for blk-iocost (Chengming)

 - Remove a dead field in struct blk_mq_hw_ctx (Chengming)

* tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux:
  loop: do not enforce max_loop hard limit by (new) default
  loop: deprecate autoloading callback loop_probe()
  sbitmap: fix batching wakeup
  blk-iocost: skip empty flush bio in iocost
  blk-mq: delete dead struct blk_mq_hw_ctx->queued field
  blk-mq: Fix stall due to recursive flush plug
2023-07-22 11:05:15 -07:00
Mauricio Faria de Oliveira
bb5faa99f0 loop: do not enforce max_loop hard limit by (new) default
Problem:

The max_loop parameter is used for 2 different purposes:

1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()

Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.

However, the default value changed in commit 85c5019771 ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.

That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.

Example:

For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:

    mknod("/dev/loopN");
    open("/dev/loopN");  // now fails with ENXIO

Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).

Solution:

The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).

This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.

Before 85c50197716c:
  - default:     1) CONFIG devices   2) no limit
  - max_loop=0:  1) CONFIG devices   2) no limit
  - max_loop=X:  1) X devices        2) X limit

After 85c50197716c:
  - default:     1) CONFIG devices   2) CONFIG limit (*)
  - max_loop=0:  1) 0 devices (*)    2) no limit
  - max_loop=X:  1) X devices        2) X limit

This commit:
  - default:     1) CONFIG devices   2) no limit (*)
  - max_loop=0:  1) 0 devices        2) no limit
  - max_loop=X:  1) X devices        2) X limit

Future:

The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.

Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.

Tests:

Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y

- default (original)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- default (patched)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	#

- max_loop=0 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	#

- max_loop=8 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	open: /dev/loop8: No such device or address

Fixes: 85c5019771 ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 13:20:57 -06:00
Mauricio Faria de Oliveira
23881aec85 loop: deprecate autoloading callback loop_probe()
The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.

The loop_probe() function is only used for that callback, so guard it
too, accordingly.

See commit fbdee71bb5 ("block: deprecate autoloading based on dev_t").

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 13:20:48 -06:00
Thomas Weißschuh
ffe357c868 nbd: automatically load module on genl access
Add a module alias to nbd.ko that allows the generic netlink core to
automatically load the module when netlink messages for nbd are
received.

This frees the user from manually having to load the module before using
nbd functionality via netlink.
If the system policy allows it this can even be used to load the nbd
module from containers which would otherwise not have access to the
necessary module files to do a normal "modprobe nbd".

For example this avoids the following error when using nbd-client:

$ nbd-client localhost 10809 /dev/nbd0
...
Error: Couldn't resolve the nbd netlink family, make sure the nbd module is loaded and your nbd driver supports the netlink interface.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Josef Bacik <josef@toxicpadna.com>
Link: https://lore.kernel.org/r/20230713-b4-nbd-genl-v3-1-226cbddba04b@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:17 -06:00
Martin K. Petersen
e96277a570 Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes
Pull in the currently staged SCSI fixes for 6.5.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-11 12:15:15 -04:00
Damien Le Moal
a3d96ed215 scsi: block: virtio_blk: Set zone limits before revalidating zones
In virtblk_probe_zoned_device(), execute blk_queue_chunk_sectors() and
blk_queue_max_zone_append_sectors() to respectively set the zoned device
zone size and maximum zone append sector limit before executing
blk_revalidate_disk_zones(). This is to allow the block layer zone
reavlidation to check these device characteristics prior to checking all
zones of the device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-5-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-05 21:58:10 -04:00
Damien Le Moal
a442b899fe scsi: block: nullblk: Set zone limits before revalidating zones
In null_register_zoned_dev(), execute blk_queue_chunk_sectors() and
blk_queue_max_zone_append_sectors() to respectively set the zoned device
zone size and maximum zone append sector limit before executing
blk_revalidate_disk_zones(). This is to allow the block layer zone
reavlidation to check these device characteristics prior to checking all
zones of the device.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-4-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-05 21:58:10 -04:00
Linus Torvalds
3a8a670eee Networking changes for 6.5.
Core
 ----
 
  - Rework the sendpage & splice implementations. Instead of feeding
    data into sockets page by page extend sendmsg handlers to support
    taking a reference on the data, controlled by a new flag called
    MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
    to invoke an additional callback instead of trying to predict what
    the right combination of MORE/NOTLAST flags is.
    Remove the MSG_SENDPAGE_NOTLAST flag completely.
 
  - Implement SCM_PIDFD, a new type of CMSG type analogous to
    SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
 
  - Enable socket busy polling with CONFIG_RT.
 
  - Improve reliability and efficiency of reporting for ref_tracker.
 
  - Auto-generate a user space C library for various Netlink families.
 
 Protocols
 ---------
 
  - Allow TCP to shrink the advertised window when necessary, prevent
    sk_rcvbuf auto-tuning from growing the window all the way up to
    tcp_rmem[2].
 
  - Use per-VMA locking for "page-flipping" TCP receive zerocopy.
 
  - Prepare TCP for device-to-device data transfers, by making sure
    that payloads are always attached to skbs as page frags.
 
  - Make the backoff time for the first N TCP SYN retransmissions
    linear. Exponential backoff is unnecessarily conservative.
 
  - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
 
  - Avoid waking up applications using TLS sockets until we have
    a full record.
 
  - Allow using kernel memory for protocol ioctl callbacks, paving
    the way to issuing ioctls over io_uring.
 
  - Add nolocalbypass option to VxLAN, forcing packets to be fully
    encapsulated even if they are destined for a local IP address.
 
  - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
    in-kernel ECMP implementation (e.g. Open vSwitch) select the same
    link for all packets. Support L4 symmetric hashing in Open vSwitch.
 
  - PPPoE: make number of hash bits configurable.
 
  - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
    (ipconfig).
 
  - Add layer 2 miss indication and filtering, allowing higher layers
    (e.g. ACL filters) to make forwarding decisions based on whether
    packet matched forwarding state in lower devices (bridge).
 
  - Support matching on Connectivity Fault Management (CFM) packets.
 
  - Hide the "link becomes ready" IPv6 messages by demoting their
    printk level to debug.
 
  - HSR: don't enable promiscuous mode if device offloads the proto.
 
  - Support active scanning in IEEE 802.15.4.
 
  - Continue work on Multi-Link Operation for WiFi 7.
 
 BPF
 ---
 
  - Add precision propagation for subprogs and callbacks. This allows
    maintaining verification efficiency when subprograms are used,
    or in fact passing the verifier at all for complex programs,
    especially those using open-coded iterators.
 
  - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
    assumed the length is always equal to the amount of written data.
    But some protos allow passing a NULL buffer to discover what
    the output buffer *should* be, without writing anything.
 
  - Accept dynptr memory as memory arguments passed to helpers.
 
  - Add routing table ID to bpf_fib_lookup BPF helper.
 
  - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
 
  - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
    maps as read-only).
 
  - Show target_{obj,btf}_id in tracing link fdinfo.
 
  - Addition of several new kfuncs (most of the names are self-explanatory):
    - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
      bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
      and bpf_dynptr_clone().
    - bpf_task_under_cgroup()
    - bpf_sock_destroy() - force closing sockets
    - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
 
 Netfilter
 ---------
 
  - Relax set/map validation checks in nf_tables. Allow checking
    presence of an entry in a map without using the value.
 
  - Increase ip_vs_conn_tab_bits range for 64BIT builds.
 
  - Allow updating size of a set.
 
  - Improve NAT tuple selection when connection is closing.
 
 Driver API
 ----------
 
  - Integrate netdev with LED subsystem, to allow configuring HW
    "offloaded" blinking of LEDs based on link state and activity
    (i.e. packets coming in and out).
 
  - Support configuring rate selection pins of SFP modules.
 
  - Factor Clause 73 auto-negotiation code out of the drivers, provide
    common helper routines.
 
  - Add more fool-proof helpers for managing lifetime of MDIO devices
    associated with the PCS layer.
 
  - Allow drivers to report advanced statistics related to Time Aware
    scheduler offload (taprio).
 
  - Allow opting out of VF statistics in link dump, to allow more VFs
    to fit into the message.
 
  - Split devlink instance and devlink port operations.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Synopsys EMAC4 IP support (stmmac)
    - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
    - Marvell 88E6250 7 port switches
    - Microchip LAN8650/1 Rev.B0 PHYs
    - MediaTek MT7981/MT7988 built-in 1GE PHY driver
 
  - WiFi:
    - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
    - Realtek RTL8723DS (SDIO variant)
    - Realtek RTL8851BE
 
  - CAN:
    - Fintek F81604
 
 Drivers
 -------
 
  - Ethernet NICs:
    - Intel (100G, ice):
      - support dynamic interrupt allocation
      - use meta data match instead of VF MAC addr on slow-path
    - nVidia/Mellanox:
      - extend link aggregation to handle 4, rather than just 2 ports
      - spawn sub-functions without any features by default
    - OcteonTX2:
      - support HTB (Tx scheduling/QoS) offload
      - make RSS hash generation configurable
      - support selecting Rx queue using TC filters
    - Wangxun (ngbe/txgbe):
      - add basic Tx/Rx packet offloads
      - add phylink support (SFP/PCS control)
    - Freescale/NXP (enetc):
      - report TAPRIO packet statistics
    - Solarflare/AMD:
      - support matching on IP ToS and UDP source port of outer header
      - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
      - add devlink dev info support for EF10
 
  - Virtual NICs:
    - Microsoft vNIC:
      - size the Rx indirection table based on requested configuration
      - support VLAN tagging
    - Amazon vNIC:
      - try to reuse Rx buffers if not fully consumed, useful for ARM
        servers running with 16kB pages
    - Google vNIC:
      - support TCP segmentation of >64kB frames
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - enable USXGMII (88E6191X)
    - Microchip:
     - lan966x: add support for Egress Stage 0 ACL engine
     - lan966x: support mapping packet priority to internal switch
       priority (based on PCP or DSCP)
 
  - Ethernet PHYs:
    - Broadcom PHYs:
      - support for Wake-on-LAN for BCM54210E/B50212E
      - report LPI counter
    - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
    - Micrel PHYs: receive timestamp in the frame (LAN8841)
    - Realtek PHYs: support optional external PHY clock
    - Altera TSE PCS: merge the driver into Lynx PCS which it is
      a variant of
 
  - CAN: Kvaser PCIEcan:
    - support packet timestamping
 
  - WiFi:
    - Intel (iwlwifi):
      - major update for new firmware and Multi-Link Operation (MLO)
      - configuration rework to drop test devices and split
        the different families
      - support for segmented PNVM images and power tables
      - new vendor entries for PPAG (platform antenna gain) feature
    - Qualcomm 802.11ax (ath11k):
      - Multiple Basic Service Set Identifier (MBSSID) and
        Enhanced MBSSID Advertisement (EMA) support in AP mode
      - support factory test mode
    - RealTek (rtw89):
      - add RSSI based antenna diversity
      - support U-NII-4 channels on 5 GHz band
    - RealTek (rtl8xxxu):
      - AP mode support for 8188f
      - support USB RX aggregation for the newer chips
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
 IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
 MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
 IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
 CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
 mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
 fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
 J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
 dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
 yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
 JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
 tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
 =9i4I
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...
2023-06-28 16:43:10 -07:00
Linus Torvalds
6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Linus Torvalds
a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
David Howells
eeac7405c7 drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
Use sendmsg() conditionally with MSG_SPLICE_PAGES in _drbd_send_page()
rather than calling sendpage() or _drbd_no_send_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Philipp Reisner <philipp.reisner@linbit.com>
cc: Lars Ellenberg <lars.ellenberg@linbit.com>
cc: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: drbd-dev@lists.linbit.com
Link: https://lore.kernel.org/r/20230623225513.2732256-11-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 15:50:13 -07:00
Ivan Orlov
2eefd399d2 ublk: make ublk_chr_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the ublk_chr_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-7-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:46:13 -06:00
Ivan Orlov
65d7a37d4e aoe: make aoe_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the aoe_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: Justin Sanders <justin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-6-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:45:19 -06:00
Ivan Orlov
137380c0ec block/rnbd: make all 'class' structures const
Now that the driver core allows for struct class to be in read-only
memory, making all 'class' structures to be declared at build time
placing them into read-only memory, instead of having to be dynamically
allocated at load time.

Cc: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
Cc: Jack Wang <jinpu.wang@ionos.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230620180129.645646-5-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:45:19 -06:00
Michael S. Tsirkin
afd384f0db Revert "virtio-blk: support completion batching for the IRQ path"
This reverts commit 07b679f70d.

This change appears to have broken things...
We now see applications hanging during disk accesses.
e.g.
multi-port virtio-blk device running in h/w (FPGA)
Host running a simple 'fio' test.
[global]
thread=1
direct=1
ioengine=libaio
norandommap=1
group_reporting=1
bs=4K
rw=read
iodepth=128
runtime=1
numjobs=4
time_based
[job0]
filename=/dev/vda
[job1]
filename=/dev/vdb
[job2]
filename=/dev/vdc
...
[job15]
filename=/dev/vdp

i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
This is repeatedly run in a loop.

After a few, normally <10 seconds, fio hangs.
With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
Last message:
fio-3.19
Starting 8 threads
Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
I think this means at the end of the run 1 queue was left incomplete.

'diskstats' (run while fio is hung) shows no outstanding transactions.
e.g.
$ cat /proc/diskstats
...
252       0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
252      16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
...

Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
e.g.
PF= 0                         vq=0           1           2           3
[a]request_count     -   839416590   813148916   105586179    84988123
[b]completion1_count -   839416590   813148916   105586179    84988123
[c]completion2_count -           0           0           0           0

PF= 1                         vq=0           1           2           3
[a]request_count     -   823335887   812516140   104582672    75856549
[b]completion1_count -   823335887   812516140   104582672    75856549
[c]completion2_count -           0           0           0           0

i.e. the issue is after the virtio-blk driver.

This change was introduced in kernel 6.3.0.
I am seeing this using 6.3.3.
If I run with an earlier kernel (5.15), it does not occur.
If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
e.g.
kernel 5.15 - this is OK
virtio_blk.c,virtblk_done() [irq handler]
                 if (likely(!blk_should_fake_timeout(req->q))) {
                          blk_mq_complete_request(req);
                 }

kernel 6.3.3 - this fails
virtio_blk.c,virtblk_handle_req() [irq handler]
                 if (likely(!blk_should_fake_timeout(req->q))) {
                          if (!blk_mq_complete_request_remote(req)) {
                                  if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
                                           virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
                                   }
                          }
                 }

If I do, kernel 6.3.3 - this is OK
virtio_blk.c,virtblk_handle_req() [irq handler]
                 if (likely(!blk_should_fake_timeout(req->q))) {
                          if (!blk_mq_complete_request_remote(req)) {
                                   virtblk_request_done(req); //force this here...
                                  if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
                                           virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
                                   }
                          }
                 }

Perhaps you might like to fix/test/revert this change...
Martin

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306090826.C1fZmdMe-lkp@intel.com/
Cc: Suwan Kim <suwan.kim027@gmail.com>
Tested-by: edliaw@google.com
Reported-by: "Roberts, Martin" <martin.roberts@intel.com>
Message-Id: <336455b4f630f329380a8f53ee8cad3868764d5c.1686295549.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-06-21 04:14:28 -04:00
Christoph Hellwig
9a7933f3ac swim: fix a missing FMODE_ -> BLK_OPEN_ conversion in floppy_open
Fix a missing conversion to the new BLK_OPEN constant in swim.

Fixes: 05bdb99653 ("block: replace fmode_t with a block-specific type for block open flags")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620043051.707196-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:04 -06:00
Sergey Senozhatsky
cb0551adf9 zram: further limit recompression threshold
Recompression threshold should be below huge-size-class watermark.  Any
object larger than huge-size-class is a "huge object" and occupies a
whole physical page on the zsmalloc side, in other words it's
incompressible, as far as zsmalloc is concerned.

Link: https://lkml.kernel.org/r/20230614141338.3480029-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Brian Geffon <bgeffon@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:33 -07:00
Pankaj Raghav
6dd4423f3f brd: use cond_resched instead of cond_resched_rcu
The body of the loop is run without RCU lock held. Use the regular
cond_resched() instead of cond_resched_rcu().

Fixes: 786bb02458 ("brd: use XArray instead of radix-tree to index backing pages")
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 11:13:07 -06:00
Christoph Hellwig
3dbd53c7be swim3: fix the floppy_locked_ioctl prototype
Add back the accidentally dropped mode parameter.

Fixes: b60f7635788a ("swim3: fix the floppy_locked_ioctl prototype")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613154309.327557-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-13 09:45:40 -06:00
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
99b0778081 rnbd-srv: replace sess->open_flags with a "bool readonly"
Stop passing the fmode_t around and just use a simple bool to track if
an export is read-only.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230608110258.189493-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
5ee607675d rnbd-srv: don't pass a holder for non-exclusive blkdev_get_by_path
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230608110258.189493-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
ae220766d8 block: remove the unused mode argument to ->release
The mode argument to the ->release block_device_operation is never used,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>			[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
d32e2bf837 block: pass a gendisk to ->open
->open is only called on the whole device.  Make that explicit by
passing a gendisk instead of the block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
444aa2c58c block: pass a gendisk on bdev_check_media_change
bdev_check_media_change should only ever be called for the whole device.
Pass a gendisk to make that explicit and rename the function to
disk_check_media_change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:03 -06:00
Guoqing Jiang
fece685cc7 block/rnbd-srv: make process_msg_sess_info returns void
Change the return type to void given it always returns 0.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-9-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-11 19:48:42 -06:00
Guoqing Jiang
d3fc0b4664 block/rnbd-srv: init err earlier in rnbd_srv_init_module
With this, we can remove several lines of code.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-8-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-11 19:48:42 -06:00