The find.h APIs are designed to be used only on unsigned long arguments.
This can technically result in a over-read, but it is harmless in this
case. Regardless, fix it to avoid the warning seen under -Warray-bounds,
which we'd like to enable globally:
In file included from ./include/linux/bitmap.h:9,
from drivers/iommu/intel/iommu.c:17:
drivers/iommu/intel/iommu.c: In function 'domain_context_mapping_one':
./include/linux/find.h:119:37: warning: array subscript 'long unsigned int[0]' is partly outside array bounds of 'int[1]' [-Warray-bounds]
119 | unsigned long val = *addr & GENMASK(size - 1, 0);
| ^~~~~
drivers/iommu/intel/iommu.c:2115:18: note: while referencing 'max_pde'
2115 | int pds, max_pde;
| ^~~~~~~
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/20211215232432.2069605-1-keescook@chromium.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When supporting only the .map and .unmap callbacks of iommu_ops,
the IOMMU driver can make assumptions about the size and alignment
used for mappings based on the driver provided pgsize_bitmap. VT-d
previously used essentially PAGE_MASK for this bitmap as any power
of two mapping was acceptably filled by native page sizes.
However, with the .map_pages and .unmap_pages interface we're now
getting page-size and count arguments. If we simply combine these
as (page-size * count) and make use of the previous map/unmap
functions internally, any size and alignment assumptions are very
different.
As an example, a given vfio device assignment VM will often create
a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff]. On a system that
does not support IOMMU super pages, the unmap_pages interface will
ask to unmap 1024 4KB pages at the base IOVA. dma_pte_clear_level()
will recurse down to level 2 of the page table where the first half
of the pfn range exactly matches the entire pte level. We clear the
pte, increment the pfn by the level size, but (oops) the next pte is
on a new page, so we exit the loop an pop back up a level. When we
then update the pfn based on that higher level, we seem to assume
that the previous pfn value was at the start of the level. In this
case the level size is 256K pfns, which we add to the base pfn and
get a results of 0x7fe00, which is clearly greater than 0x401ff,
so we're done. Meanwhile we never cleared the ptes for the remainder
of the range. When the VM remaps this range, we're overwriting valid
ptes and the VT-d driver complains loudly, as reported by the user
report linked below.
The fix for this seems relatively simple, if each iteration of the
loop in dma_pte_clear_level() is assumed to clear to the end of the
level pte page, then our next pfn should be calculated from level_pfn
rather than our working pfn.
Fixes: 3f34f12597 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/
Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The __domain_mapping() always removes the pages in the range from
'iov_pfn' to 'end_pfn', but the 'end_pfn' is always the last pfn
of the range that the caller wants to map.
This would introduce too many duplicated removing and leads the
map operation take too long, for example:
Map iova=0x100000,nr_pages=0x7d61800
iov_pfn: 0x100000, end_pfn: 0x7e617ff
iov_pfn: 0x140000, end_pfn: 0x7e617ff
iov_pfn: 0x180000, end_pfn: 0x7e617ff
iov_pfn: 0x1c0000, end_pfn: 0x7e617ff
iov_pfn: 0x200000, end_pfn: 0x7e617ff
...
it takes about 50ms in total.
We can reduce the cost by recalculate the 'end_pfn' and limit it
to the boundary of the end of this pte page.
Map iova=0x100000,nr_pages=0x7d61800
iov_pfn: 0x100000, end_pfn: 0x13ffff
iov_pfn: 0x140000, end_pfn: 0x17ffff
iov_pfn: 0x180000, end_pfn: 0x1bffff
iov_pfn: 0x1c0000, end_pfn: 0x1fffff
iov_pfn: 0x200000, end_pfn: 0x23ffff
...
it only need 9ms now.
This also removes a meaningless BUG_ON() in __domain_mapping().
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Tested-by: Liujunjie <liujunjie23@huawei.com>
Link: https://lore.kernel.org/r/20211008000433.1115-1-longpeng2@huawei.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211014053839.727419-10-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The IOMMU VT-d implementation uses the first level for GPA->HPA translation
by default. Although both the first level and the second level could handle
the DMA translation, they're different in some way. For example, the second
level translation has separate controls for the Access/Dirty page tracking.
With the first level translation, there's no such control. On the other
hand, the second level translation has the page-level control for forcing
snoop, but the first level only has global control with pasid granularity.
This uses the second level for GPA->HPA translation so that we can provide
a consistent hardware interface for use cases like dirty page tracking for
live migration.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20210926114535.923263-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20211014053839.727419-6-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
pasid_mutex and dev->iommu->param->lock are held while unbinding mm is
flushing IO page fault workqueue and waiting for all page fault works to
finish. But an in-flight page fault work also need to hold the two locks
while unbinding mm are holding them and waiting for the work to finish.
This may cause an ABBA deadlock issue as shown below:
idxd 0000:00:0a.0: unbind PASID 2
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc7+ #549 Not tainted [ 186.615245] ----------
dsa_test/898 is trying to acquire lock:
ffff888100d854e8 (¶m->lock){+.+.}-{3:3}, at:
iopf_queue_flush_dev+0x29/0x60
but task is already holding lock:
ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at:
intel_svm_unbind+0x34/0x1e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (pasid_mutex){+.+.}-{3:3}:
__mutex_lock+0x75/0x730
mutex_lock_nested+0x1b/0x20
intel_svm_page_response+0x8e/0x260
iommu_page_response+0x122/0x200
iopf_handle_group+0x1c2/0x240
process_one_work+0x2a5/0x5a0
worker_thread+0x55/0x400
kthread+0x13b/0x160
ret_from_fork+0x22/0x30
-> #1 (¶m->fault_param->lock){+.+.}-{3:3}:
__mutex_lock+0x75/0x730
mutex_lock_nested+0x1b/0x20
iommu_report_device_fault+0xc2/0x170
prq_event_thread+0x28a/0x580
irq_thread_fn+0x28/0x60
irq_thread+0xcf/0x180
kthread+0x13b/0x160
ret_from_fork+0x22/0x30
-> #0 (¶m->lock){+.+.}-{3:3}:
__lock_acquire+0x1134/0x1d60
lock_acquire+0xc6/0x2e0
__mutex_lock+0x75/0x730
mutex_lock_nested+0x1b/0x20
iopf_queue_flush_dev+0x29/0x60
intel_svm_drain_prq+0x127/0x210
intel_svm_unbind+0xc5/0x1e0
iommu_sva_unbind_device+0x62/0x80
idxd_cdev_release+0x15a/0x200 [idxd]
__fput+0x9c/0x250
____fput+0xe/0x10
task_work_run+0x64/0xa0
exit_to_user_mode_prepare+0x227/0x230
syscall_exit_to_user_mode+0x2c/0x60
do_syscall_64+0x48/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
¶m->lock --> ¶m->fault_param->lock --> pasid_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pasid_mutex);
lock(¶m->fault_param->lock);
lock(pasid_mutex);
lock(¶m->lock);
*** DEADLOCK ***
2 locks held by dsa_test/898:
#0: ffff888100cc1cc0 (&group->mutex){+.+.}-{3:3}, at:
iommu_sva_unbind_device+0x53/0x80
#1: ffffffff82b2f7c8 (pasid_mutex){+.+.}-{3:3}, at:
intel_svm_unbind+0x34/0x1e0
stack backtrace:
CPU: 2 PID: 898 Comm: dsa_test Not tainted 5.14.0-rc7+ #549
Hardware name: Intel Corporation Kabylake Client platform/KBL S
DDR4 UD IMM CRB, BIOS KBLSE2R1.R00.X050.P01.1608011715 08/01/2016
Call Trace:
dump_stack_lvl+0x5b/0x74
dump_stack+0x10/0x12
print_circular_bug.cold+0x13d/0x142
check_noncircular+0xf1/0x110
__lock_acquire+0x1134/0x1d60
lock_acquire+0xc6/0x2e0
? iopf_queue_flush_dev+0x29/0x60
? pci_mmcfg_read+0xde/0x240
__mutex_lock+0x75/0x730
? iopf_queue_flush_dev+0x29/0x60
? pci_mmcfg_read+0xfd/0x240
? iopf_queue_flush_dev+0x29/0x60
mutex_lock_nested+0x1b/0x20
iopf_queue_flush_dev+0x29/0x60
intel_svm_drain_prq+0x127/0x210
? intel_pasid_tear_down_entry+0x22e/0x240
intel_svm_unbind+0xc5/0x1e0
iommu_sva_unbind_device+0x62/0x80
idxd_cdev_release+0x15a/0x200
pasid_mutex protects pasid and svm data mapping data. It's unnecessary
to hold pasid_mutex while flushing the workqueue. To fix the deadlock
issue, unlock pasid_pasid during flushing the workqueue to allow the works
to be handled.
Fixes: d5b9e4bfe0 ("iommu/vt-d: Report prq to io-pgfault framework")
Reported-and-tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20210826215918.4073446-1-fenghua.yu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210828070622.2437559-3-baolu.lu@linux.intel.com
[joro: Removed timing information from kernel log messages]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This fixes improper iotlb invalidation in intel_pasid_tear_down_entry().
When a PASID was used as nested mode, released and reused, the following
error message will appear:
[ 180.187556] Unexpected page request in Privilege Mode
[ 180.187565] Unexpected page request in Privilege Mode
[ 180.279933] Unexpected page request in Privilege Mode
[ 180.279937] Unexpected page request in Privilege Mode
Per chapter 6.5.3.3 of VT-d spec 3.3, when tear down a pasid entry, the
software should use Domain selective IOTLB flush if the PGTT of the pasid
entry is SL only or Nested, while for the pasid entries whose PGTT is FL
only or PT using PASID-based IOTLB flush is enough.
Fixes: 2cd1311a26 ("iommu/vt-d: Add set domain DOMAIN_ATTR_NESTING attr")
Signed-off-by: Kumar Sanjay K <sanjay.k.kumar@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Tested-by: Yi Sun <yi.y.sun@intel.com>
Link: https://lore.kernel.org/r/20210817042425.1784279-1-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210817124321.1517985-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A PASID reference is increased whenever a device is bound to an mm (and
its PASID) successfully (i.e. the device's sdev user count is increased).
But the reference is not dropped every time the device is unbound
successfully from the mm (i.e. the device's sdev user count is decreased).
The reference is dropped only once by calling intel_svm_free_pasid() when
there isn't any device bound to the mm. intel_svm_free_pasid() drops the
reference and only frees the PASID on zero reference.
Fix the issue by dropping the PASID reference and freeing the PASID when
no reference on successful unbinding the device by calling
intel_svm_free_pasid() .
Fixes: 4048377414 ("iommu/vt-d: Use iommu_sva_alloc(free)_pasid() helpers")
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/20210813181345.1870742-1-fenghua.yu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210817124321.1517985-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The pgsize bitmap is used to advertise the page sizes our hardware supports
to the IOMMU core, which will then use this information to split physically
contiguous memory regions it is mapping into page sizes that we support.
Traditionally the IOMMU core just handed us the mappings directly, after
making sure the size is an order of a 4KiB page and that the mapping has
natural alignment. To retain this behavior, we currently advertise that we
support all page sizes that are an order of 4KiB.
We are about to utilize the new IOMMU map/unmap_pages APIs. We could change
this to advertise the real page sizes we support.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210720020615.4144323-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Make IOMMU_DEFAULT_LAZY default for when INTEL_IOMMU config is set,
as is current behaviour.
Also delete global flag intel_iommu_strict:
- In intel_iommu_setup(), call iommu_set_dma_strict(true) directly. Also
remove the print, as iommu_subsys_init() prints the mode and we have
already marked this param as deprecated.
- For cap_caching_mode() check in intel_iommu_setup(), call
iommu_set_dma_strict(true) directly; also reword the accompanying print
with a level downgrade and also add the missing '\n'.
- For Ironlake GPU, again call iommu_set_dma_strict(true) directly and
keep the accompanying print.
[jpg: Remove intel_iommu_strict]
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/1626088340-5838-5-git-send-email-john.garry@huawei.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Passing a 64-bit address width to iommu_setup_dma_ops() is valid on
virtual platforms, but isn't currently possible. The overflow check in
iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass
a limit address instead of a size, so callers don't have to fake a size
to work around the check.
The base and limit parameters are being phased out, because:
* they are redundant for x86 callers. dma-iommu already reserves the
first page, and the upper limit is already in domain->geometry.
* they can now be obtained from dev->dma_range_map on Arm.
But removing them on Arm isn't completely straightforward so is left for
future work. As an intermediate step, simplify the x86 callers by
passing dummy limits.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20210618152059.1194210-5-jean-philippe@linaro.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The assignment of iommu from info->iommu occurs before info is null checked
hence leading to a potential null pointer dereference issue. Fix this by
assigning iommu and checking if iommu is null after null checking info.
Addresses-Coverity: ("Dereference before null check")
Fixes: 4c82b88696 ("iommu/vt-d: Allocate/register iopf queue for sva devices")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20210611135024.32781-1-colin.king@canonical.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A recent commit broke the build on 32-bit x86. The linker throws these
messages:
ld: drivers/iommu/intel/perf.o: in function `dmar_latency_snapshot':
perf.c:(.text+0x40c): undefined reference to `__udivdi3'
ld: perf.c:(.text+0x458): undefined reference to `__udivdi3'
The reason are the 64-bit divides in dmar_latency_snapshot(). Use the
div_u64() helper function for those.
Fixes: 55ee5e67a5 ("iommu/vt-d: Add common code for dmar latency performance monitors")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20210610083120.29224-1-joro@8bytes.org