block-6.2-2022-12-29

-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOt35IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq4QD/9nGWlCdLRLyyiysUWhLwTmsZt7PSebG3KD
 CmCyEt+o8n2PdxBe7Xq8glppvuQTwJwOYynMXcAWd0IBxYUnAkCDF4PTbmdiIiVY
 fzci1UydIXw/HOVft/2IbIC0+Apo+UJ9WVqqhwm7ya0lAkLQYuT7iWmn1pxFdbcI
 hi9ZbaghxtZXSQP4ZtKG+a8tQ99HTsf76xqCM6DdMCVOUH6/V1f5g67iSkYLCL3Q
 V9bAq7U2VEXFdRC9m5yPG7KGUBRllE4etBvVAIIcAQBAgEktyvgvas5luwu5j+W0
 R2z8KXp2X4BWGW+R45hpt2cdyfcJy24+6QnAGNQAs/3Muq1IfEMwmJ5tyR/y8HiS
 0RvIv/BOwDMDOaM9YuW0beyHQMu+bwhtf+C453r1gsKmnL912+ElMzuqUpditkjr
 d4nL5aUTk5iM38jzJpQylZSY+20wyUnOmxCxETpeSMaRrYY75PLOVCJLNncJuZtQ
 GFtqUzMPVURLMGnxyJZLiG+qbGVXh9f7B7OStKDPhBJvqoZ2cQpwTzywmYxQOv+0
 OO1DdmMDtUWNpuBN2U4HOzLElmB034OM3Fcia529IhLoXK/x57n9mXW0D0HeOd84
 /EYSsmsT+spv7psKBNjhXkZwgVpVgsYOu8eUjRKYUmrYLEbTk+fGUtia3rBd4wjl
 uNMuRhRtUA==
 =cqhz
 -----END PGP SIGNATURE-----

Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Mostly just NVMe, but also a single fixup for BFQ for a regression
  that happened during the merge window. In detail:

   - NVMe pull requests via Christoph:
      - Fix doorbell buffer value endianness (Klaus Jensen)
      - Fix Linux vs NVMe page size mismatch (Keith Busch)
      - Fix a potential use memory access beyong the allocation limit
        (Keith Busch)
      - Fix a multipath vs blktrace NULL pointer dereference (Yanjun
        Zhang)
      - Fix various problems in handling the Command Supported and
        Effects log (Christoph Hellwig)
      - Don't allow unprivileged passthrough of commands that don't
        transfer data but modify logical block content (Christoph
        Hellwig)
      - Add a features and quirks policy document (Christoph Hellwig)
      - Fix some really nasty code that was correct but made smatch
        complain (Sagi Grimberg)

   - Use-after-free regression in BFQ from this merge window (Yu)"

* tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux:
  nvme-auth: fix smatch warning complaints
  nvme: consult the CSE log page for unprivileged passthrough
  nvme: also return I/O command effects from nvme_command_effects
  nvmet: don't defer passthrough commands with trivial effects to the workqueue
  nvmet: set the LBCC bit for commands that modify data
  nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
  nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
  docs, nvme: add a feature and quirk policy document
  nvme-pci: update sqsize when adjusting the queue depth
  nvme: fix setting the queue depth in nvme_alloc_io_tag_set
  block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
  nvme: fix multipath crash caused by flush request when blktrace is enabled
  nvme-pci: fix page size checks
  nvme-pci: fix mempool alloc size
  nvme-pci: fix doorbell buffer value endianness
This commit is contained in:
Linus Torvalds 2022-12-29 16:57:29 -08:00
commit bff687b3da
12 changed files with 185 additions and 58 deletions

View File

@ -104,3 +104,4 @@ to do something different in the near future.
../riscv/patch-acceptance
../driver-api/media/maintainer-entry-profile
../driver-api/vfio-pci-device-specific-driver-acceptance
../nvme/feature-and-quirk-policy

View File

@ -0,0 +1,77 @@
.. SPDX-License-Identifier: GPL-2.0
=======================================
Linux NVMe feature and and quirk policy
=======================================
This file explains the policy used to decide what is supported by the
Linux NVMe driver and what is not.
Introduction
============
NVM Express is an open collection of standards and information.
The Linux NVMe host driver in drivers/nvme/host/ supports devices
implementing the NVM Express (NVMe) family of specifications, which
currently consists of a number of documents:
- the NVMe Base specification
- various Command Set specifications (e.g. NVM Command Set)
- various Transport specifications (e.g. PCIe, Fibre Channel, RDMA, TCP)
- the NVMe Management Interface specification
See https://nvmexpress.org/developers/ for the NVMe specifications.
Supported features
==================
NVMe is a large suite of specifications, and contains features that are only
useful or suitable for specific use-cases. It is important to note that Linux
does not aim to implement every feature in the specification. Every additional
feature implemented introduces more code, more maintenance and potentially more
bugs. Hence there is an inherent tradeoff between functionality and
maintainability of the NVMe host driver.
Any feature implemented in the Linux NVMe host driver must support the
following requirements:
1. The feature is specified in a release version of an official NVMe
specification, or in a ratified Technical Proposal (TP) that is
available on NVMe website. Or if it is not directly related to the
on-wire protocol, does not contradict any of the NVMe specifications.
2. Does not conflict with the Linux architecture, nor the design of the
NVMe host driver.
3. Has a clear, indisputable value-proposition and a wide consensus across
the community.
Vendor specific extensions are generally not supported in the NVMe host
driver.
It is strongly recommended to work with the Linux NVMe and block layer
maintainers and get feedback on specification changes that are intended
to be used by the Linux NVMe host driver in order to avoid conflict at a
later stage.
Quirks
======
Sometimes implementations of open standards fail to correctly implement parts
of the standards. Linux uses identifier-based quirks to work around such
implementation bugs. The intent of quirks is to deal with widely available
hardware, usually consumer, which Linux users can't use without these quirks.
Typically these implementations are not or only superficially tested with Linux
by the hardware manufacturer.
The Linux NVMe maintainers decide ad hoc whether to quirk implementations
based on the impact of the problem to Linux users and how it impacts
maintainability of the driver. In general quirks are a last resort, if no
firmware updates or other workarounds are available from the vendor.
Quirks will not be added to the Linux kernel for hardware that isn't available
on the mass market. Hardware that fails qualification for enterprise Linux
distributions, ChromeOS, Android or other consumers of the Linux kernel
should be fixed before it is shipped instead of relying on Linux quirks.

View File

@ -14916,6 +14916,7 @@ L: linux-nvme@lists.infradead.org
S: Supported
W: http://git.infradead.org/nvme.git
T: git://git.infradead.org/nvme.git
F: Documentation/nvme/
F: drivers/nvme/host/
F: drivers/nvme/common/
F: include/linux/nvme*

View File

@ -5317,8 +5317,8 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync)
unsigned long flags;
spin_lock_irqsave(&bfqd->lock, flags);
bfq_exit_bfqq(bfqd, bfqq);
bic_set_bfqq(bic, NULL, is_sync);
bfq_exit_bfqq(bfqd, bfqq);
spin_unlock_irqrestore(&bfqd->lock, flags);
}
}

View File

@ -953,7 +953,7 @@ int nvme_auth_init_ctrl(struct nvme_ctrl *ctrl)
goto err_free_dhchap_secret;
if (!ctrl->opts->dhchap_secret && !ctrl->opts->dhchap_ctrl_secret)
return ret;
return 0;
ctrl->dhchap_ctxs = kvcalloc(ctrl_max_dhchaps(ctrl),
sizeof(*chap), GFP_KERNEL);

View File

@ -1074,6 +1074,18 @@ static u32 nvme_known_admin_effects(u8 opcode)
return 0;
}
static u32 nvme_known_nvm_effects(u8 opcode)
{
switch (opcode) {
case nvme_cmd_write:
case nvme_cmd_write_zeroes:
case nvme_cmd_write_uncor:
return NVME_CMD_EFFECTS_LBCC;
default:
return 0;
}
}
u32 nvme_command_effects(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
{
u32 effects = 0;
@ -1081,16 +1093,24 @@ u32 nvme_command_effects(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
if (ns) {
if (ns->head->effects)
effects = le32_to_cpu(ns->head->effects->iocs[opcode]);
if (ns->head->ids.csi == NVME_CAP_CSS_NVM)
effects |= nvme_known_nvm_effects(opcode);
if (effects & ~(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC))
dev_warn_once(ctrl->device,
"IO command:%02x has unhandled effects:%08x\n",
"IO command:%02x has unusual effects:%08x\n",
opcode, effects);
return 0;
}
if (ctrl->effects)
effects = le32_to_cpu(ctrl->effects->acs[opcode]);
effects |= nvme_known_admin_effects(opcode);
/*
* NVME_CMD_EFFECTS_CSE_MASK causes a freeze all I/O queues,
* which would deadlock when done on an I/O command. Note that
* We already warn about an unusual effect above.
*/
effects &= ~NVME_CMD_EFFECTS_CSE_MASK;
} else {
if (ctrl->effects)
effects = le32_to_cpu(ctrl->effects->acs[opcode]);
effects |= nvme_known_admin_effects(opcode);
}
return effects;
}
@ -4926,7 +4946,7 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
memset(set, 0, sizeof(*set));
set->ops = ops;
set->queue_depth = ctrl->sqsize + 1;
set->queue_depth = min_t(unsigned, ctrl->sqsize, BLK_MQ_MAX_DEPTH - 1);
/*
* Some Apple controllers requires tags to be unique across admin and
* the (only) I/O queue, so reserve the first 32 tags of the I/O queue.

View File

@ -11,6 +11,8 @@
static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
fmode_t mode)
{
u32 effects;
if (capable(CAP_SYS_ADMIN))
return true;
@ -43,11 +45,29 @@ static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
}
/*
* Only allow I/O commands that transfer data to the controller if the
* special file is open for writing, but always allow I/O commands that
* transfer data from the controller.
* Check if the controller provides a Commands Supported and Effects log
* and marks this command as supported. If not reject unprivileged
* passthrough.
*/
if (nvme_is_write(c))
effects = nvme_command_effects(ns->ctrl, ns, c->common.opcode);
if (!(effects & NVME_CMD_EFFECTS_CSUPP))
return false;
/*
* Don't allow passthrough for command that have intrusive (or unknown)
* effects.
*/
if (effects & ~(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC |
NVME_CMD_EFFECTS_UUID_SEL |
NVME_CMD_EFFECTS_SCOPE_MASK))
return false;
/*
* Only allow I/O commands that transfer data to the controller or that
* change the logical block contents if the file descriptor is open for
* writing.
*/
if (nvme_is_write(c) || (effects & NVME_CMD_EFFECTS_LBCC))
return mode & FMODE_WRITE;
return true;
}

View File

@ -893,7 +893,7 @@ static inline void nvme_trace_bio_complete(struct request *req)
{
struct nvme_ns *ns = req->q->queuedata;
if (req->cmd_flags & REQ_NVME_MPATH)
if ((req->cmd_flags & REQ_NVME_MPATH) && req->bio)
trace_block_bio_complete(ns->head->disk->queue, req->bio);
}

View File

@ -36,7 +36,7 @@
#define SQ_SIZE(q) ((q)->q_depth << (q)->sqes)
#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
#define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
#define SGES_PER_PAGE (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
/*
* These can be higher, but we need to ensure that any command doesn't
@ -144,9 +144,9 @@ struct nvme_dev {
mempool_t *iod_mempool;
/* shadow doorbell buffer support: */
u32 *dbbuf_dbs;
__le32 *dbbuf_dbs;
dma_addr_t dbbuf_dbs_dma_addr;
u32 *dbbuf_eis;
__le32 *dbbuf_eis;
dma_addr_t dbbuf_eis_dma_addr;
/* host memory buffer support: */
@ -208,10 +208,10 @@ struct nvme_queue {
#define NVMEQ_SQ_CMB 1
#define NVMEQ_DELETE_ERROR 2
#define NVMEQ_POLLED 3
u32 *dbbuf_sq_db;
u32 *dbbuf_cq_db;
u32 *dbbuf_sq_ei;
u32 *dbbuf_cq_ei;
__le32 *dbbuf_sq_db;
__le32 *dbbuf_cq_db;
__le32 *dbbuf_sq_ei;
__le32 *dbbuf_cq_ei;
struct completion delete_done;
};
@ -343,11 +343,11 @@ static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)
}
/* Update dbbuf and return true if an MMIO is required */
static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
volatile u32 *dbbuf_ei)
static bool nvme_dbbuf_update_and_check_event(u16 value, __le32 *dbbuf_db,
volatile __le32 *dbbuf_ei)
{
if (dbbuf_db) {
u16 old_value;
u16 old_value, event_idx;
/*
* Ensure that the queue is written before updating
@ -355,8 +355,8 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
*/
wmb();
old_value = *dbbuf_db;
*dbbuf_db = value;
old_value = le32_to_cpu(*dbbuf_db);
*dbbuf_db = cpu_to_le32(value);
/*
* Ensure that the doorbell is updated before reading the event
@ -366,7 +366,8 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
*/
mb();
if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
event_idx = le32_to_cpu(*dbbuf_ei);
if (!nvme_dbbuf_need_event(event_idx, value, old_value))
return false;
}
@ -380,9 +381,9 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db,
*/
static int nvme_pci_npages_prp(void)
{
unsigned nprps = DIV_ROUND_UP(NVME_MAX_KB_SZ + NVME_CTRL_PAGE_SIZE,
NVME_CTRL_PAGE_SIZE);
return DIV_ROUND_UP(8 * nprps, PAGE_SIZE - 8);
unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
}
/*
@ -392,7 +393,7 @@ static int nvme_pci_npages_prp(void)
static int nvme_pci_npages_sgl(void)
{
return DIV_ROUND_UP(NVME_MAX_SEGS * sizeof(struct nvme_sgl_desc),
PAGE_SIZE);
NVME_CTRL_PAGE_SIZE);
}
static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
@ -708,7 +709,7 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
sge->length = cpu_to_le32(entries * sizeof(*sge));
sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
} else {
sge->length = cpu_to_le32(PAGE_SIZE);
sge->length = cpu_to_le32(NVME_CTRL_PAGE_SIZE);
sge->type = NVME_SGL_FMT_SEG_DESC << 4;
}
}
@ -2332,10 +2333,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (dev->cmb_use_sqes) {
result = nvme_cmb_qdepth(dev, nr_io_queues,
sizeof(struct nvme_command));
if (result > 0)
if (result > 0) {
dev->q_depth = result;
else
dev->ctrl.sqsize = result - 1;
} else {
dev->cmb_use_sqes = false;
}
}
do {
@ -2536,7 +2539,6 @@ static int nvme_pci_enable(struct nvme_dev *dev)
dev->q_depth = min_t(u32, NVME_CAP_MQES(dev->ctrl.cap) + 1,
io_queue_depth);
dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
dev->dbs = dev->bar + 4096;
@ -2577,7 +2579,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
dev->q_depth);
}
dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
nvme_map_cmb(dev);

View File

@ -164,26 +164,31 @@ out:
static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
{
log->acs[nvme_admin_get_log_page] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_identify] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_abort_cmd] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_set_features] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_get_features] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_async_event] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_keep_alive] = cpu_to_le32(1 << 0);
log->acs[nvme_admin_get_log_page] =
log->acs[nvme_admin_identify] =
log->acs[nvme_admin_abort_cmd] =
log->acs[nvme_admin_set_features] =
log->acs[nvme_admin_get_features] =
log->acs[nvme_admin_async_event] =
log->acs[nvme_admin_keep_alive] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
log->iocs[nvme_cmd_read] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_write] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_flush] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_dsm] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_write_zeroes] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_read] =
log->iocs[nvme_cmd_flush] =
log->iocs[nvme_cmd_dsm] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
log->iocs[nvme_cmd_write] =
log->iocs[nvme_cmd_write_zeroes] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC);
}
static void nvmet_get_cmd_effects_zns(struct nvme_effects_log *log)
{
log->iocs[nvme_cmd_zone_append] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_zone_mgmt_send] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_zone_mgmt_recv] = cpu_to_le32(1 << 0);
log->iocs[nvme_cmd_zone_append] =
log->iocs[nvme_cmd_zone_mgmt_send] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC);
log->iocs[nvme_cmd_zone_mgmt_recv] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
}
static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req)

View File

@ -334,14 +334,13 @@ static void nvmet_passthru_execute_cmd(struct nvmet_req *req)
}
/*
* If there are effects for the command we are about to execute, or
* an end_req function we need to use nvme_execute_passthru_rq()
* synchronously in a work item seeing the end_req function and
* nvme_passthru_end() can't be called in the request done callback
* which is typically in interrupt context.
* If a command needs post-execution fixups, or there are any
* non-trivial effects, make sure to execute the command synchronously
* in a workqueue so that nvme_passthru_end gets called.
*/
effects = nvme_command_effects(ctrl, ns, req->cmd->common.opcode);
if (req->p.use_workqueue || effects) {
if (req->p.use_workqueue ||
(effects & ~(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC))) {
INIT_WORK(&req->p.work, nvmet_passthru_execute_cmd_work);
req->p.rq = rq;
queue_work(nvmet_wq, &req->p.work);

View File

@ -7,6 +7,7 @@
#ifndef _LINUX_NVME_H
#define _LINUX_NVME_H
#include <linux/bits.h>
#include <linux/types.h>
#include <linux/uuid.h>
@ -639,8 +640,9 @@ enum {
NVME_CMD_EFFECTS_NCC = 1 << 2,
NVME_CMD_EFFECTS_NIC = 1 << 3,
NVME_CMD_EFFECTS_CCC = 1 << 4,
NVME_CMD_EFFECTS_CSE_MASK = 3 << 16,
NVME_CMD_EFFECTS_CSE_MASK = GENMASK(18, 16),
NVME_CMD_EFFECTS_UUID_SEL = 1 << 19,
NVME_CMD_EFFECTS_SCOPE_MASK = GENMASK(31, 20),
};
struct nvme_effects_log {