linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-10 22:21:40 +00:00

Author	SHA1	Message	Date
Yi Liu	89d63875d8	iommufd: Flow user flags for domain allocation to domain_alloc_user() Extends iommufd_hw_pagetable_alloc() to accept user flags, the uAPI will provide the flags. Link: https://lore.kernel.org/r/20230928071528.26258-4-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Yi Liu	7975b72208	iommufd: Use the domain_alloc_user() op for domain allocation Make IOMMUFD use iommu_domain_alloc_user() by default for iommu_domain creation. IOMMUFD needs to support iommu_domain allocation with parameters from userspace in nested support, and a driver is expected to implement everything under this op. If the iommu driver doesn't provide domain_alloc_user callback then IOMMUFD falls back to use iommu_domain_alloc() with an UNMANAGED type if possible. Link: https://lore.kernel.org/r/20230928071528.26258-3-yi.l.liu@intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Jason Gunthorpe	1c68cbc64f	iommu: Add IOMMU_DOMAIN_PLATFORM This is used when the iommu driver is taking control of the dma_ops, currently only on S390 and power spapr. It is designed to preserve the original ops->detach_dev() semantic that these S390 was built around. Provide an opaque domain type and a 'default_domain' ops value that allows the driver to trivially force any single domain as the default domain. Update iommufd selftest to use this instead of set_platform_dma_ops Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-09-25 11:40:52 +02:00
Jason Gunthorpe	df31b29847	iommu: Add iommu_ops->identity_domain This allows a driver to set a global static to an IDENTITY domain and the core code will automatically use it whenever an IDENTITY domain is requested. By making it always available it means the IDENTITY can be used in error handling paths to force the iommu driver into a known state. Devices implementing global static identity domains should avoid failing their attach_dev ops. To make global static domains simpler allow drivers to omit their free function and update the iommufd selftest. Convert rockchip to use the new mechanism. Tested-by: Steven Price <steven.price@arm.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-09-25 11:40:51 +02:00
Linus Torvalds	4debf77169	iommufd for 6.6 This includes a shared branch with VFIO: - Enhance VFIO_DEVICE_GET_PCI_HOT_RESET_INFO so it can work with iommufd FDs, not just group FDs. This removes the last place in the uAPI that required the group fd. - Give VFIO a new device node /dev/vfio/devices/vfioX (the so called cdev node) which is very similar to the FD from VFIO_GROUP_GET_DEVICE_FD. The cdev is associated with the struct device that the VFIO driver is bound to and shows up in sysfs in the normal way. - Add a cdev IOCTL VFIO_DEVICE_BIND_IOMMUFD which allows a newly opened /dev/vfio/devices/vfioX to be associated with an IOMMUFD, this replaces the VFIO_GROUP_SET_CONTAINER flow. - Add cdev IOCTLs VFIO_DEVICE_[AT\|DE]TACH_IOMMUFD_PT to allow the IOMMU translation the vfio_device is associated with to be changed. This is a significant new feature for VFIO as previously each vfio_device was fixed to a single translation. The translation is under the control of iommufd, so it can be any of the different translation modes that iommufd is learning to create. At this point VFIO has compilation options to remove the legacy interfaces and in modern mode it behaves like a normal driver subsystem. The /dev/vfio/iommu and /dev/vfio/groupX nodes are not present and each vfio_device only has a /dev/vfio/devices/vfioX cdev node that represents the device. On top of this is built some of the new iommufd functionality: - IOMMU_HWPT_ALLOC allows userspace to directly create the low level IO Page table objects and affiliate them with IOAS objects that hold the translation mapping. This is the basic functionality for the normal IOMMU_DOMAIN_PAGING domains. - VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current translation. This is wired up to through all the layers down to the driver so the driver has the ability to implement a hitless replacement. This is necessary to fully support guest behaviors when emulating HW (eg guest atomic change of translation) - IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that owns a VFIO device. This includes support for the Intel iommu, and patches have been posted for all the other server IOMMU. Along the way are a number of internal items: - New iommufd kapis iommufd_ctx_has_group(), iommufd_device_to_ictx(), iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(), iommufd_device_replace() - iommufd now internally tracks iommu_groups as it needs some per-group data - Reorganize how the internal hwpt allocation flows to have more robust locking - Improve the access interfaces to support detach and replace of an IOAS from an access - New selftests and a rework of how the selftests creates a mock iommu driver to be more like a real iommu driver -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZO/QDQAKCRCFwuHvBreF YZ2iAP4hNEF6MJLRI2A28V3I/80f3x9Ed3Cirp/Q8ZdVEE+HYQD8DFaafJ0y3iPQ 5mxD4ZrZ9KfUns/gUqCT5oPHjrcvSAM= =EQCw -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "On top of the vfio updates is built some new iommufd functionality: - IOMMU_HWPT_ALLOC allows userspace to directly create the low level IO Page table objects and affiliate them with IOAS objects that hold the translation mapping. This is the basic functionality for the normal IOMMU_DOMAIN_PAGING domains. - VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current translation. This is wired up to through all the layers down to the driver so the driver has the ability to implement a hitless replacement. This is necessary to fully support guest behaviors when emulating HW (eg guest atomic change of translation) - IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that owns a VFIO device. This includes support for the Intel iommu, and patches have been posted for all the other server IOMMU. Along the way are a number of internal items: - New iommufd kernel APIs: iommufd_ctx_has_group(), iommufd_device_to_ictx(), iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(), iommufd_device_replace() - iommufd now internally tracks iommu_groups as it needs some per-group data - Reorganize how the internal hwpt allocation flows to have more robust locking - Improve the access interfaces to support detach and replace of an IOAS from an access - New selftests and a rework of how the selftests creates a mock iommu driver to be more like a real iommu driver" Link: https://lore.kernel.org/lkml/ZO%2FTe6LU1ENf58ZW@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (34 commits) iommufd/selftest: Don't leak the platform device memory when unloading the module iommu/vt-d: Implement hw_info for iommu capability query iommufd/selftest: Add coverage for IOMMU_GET_HW_INFO ioctl iommufd: Add IOMMU_GET_HW_INFO iommu: Add new iommu op to get iommu hardware information iommu: Move dev_iommu_ops() to private header iommufd: Remove iommufd_ref_to_users() iommufd/selftest: Make the mock iommu driver into a real driver vfio: Support IO page table replacement iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_REPLACE_IOAS coverage iommufd: Add iommufd_access_replace() API iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object iommufd: Add iommufd_access_change_ioas(_id) helpers iommufd: Allow passing in iopt_access_list_id to iopt_remove_access() vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages() iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC iommufd/selftest: Return the real idev id from selftest mock_domain iommufd: Add IOMMU_HWPT_ALLOC iommufd/selftest: Test iommufd_device_replace() iommufd: Make destroy_rwsem use a lock class per object type ...	2023-08-30 20:41:37 -07:00
Linus Torvalds	ec0e2dc810	VFIO updates for v6.6-rc1 - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD. (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage. (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires. (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration. (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code. (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment. (Stefan Hajnoczi) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmTvnDUbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsimEEP/AzG+VRcu5LfYbLGLe0z zB8ts6G7S78wXlmfN/LYi3v92XWvMMcm+vYF8oNAMfr1YL5sibWN6UtQfY1KCr7h nWKdQdqjajJ5yDDZnOFdhqHJGNfmZw6+fey8Z0j8zRI2oymK4DncWWX3g/7L1SNr 9tIexGJef+mOdAmC94yOut3YviAaZ+f95T/xrdXHzzoNr50DD0+PD6AJdKJfKggP vhiC/DAYH3Fofaa6tRasgWuKCYWdjZLR/kxgNpeEmW6kZnbq/dnzZ+kgn4HH1f9G 8p7UKVARR6FfG5aLheWu6Y9PDaKnfnqu8y/hobuE/ivXcmqqK+a6xSxrjgbVs8WJ 94SYnTBRoTlDJaKWa7GxqdgzJnV+s5ZyAgPhjzdi6mLTPWGzkuLhFWGtYL+LZAQ6 pNeZSM6CFBk+bva/xT0nNPCXxPh+/j/Y0G18FREj8aPFc03HrJQqz0RLydvTnoDz nX/by5KdzMSVSVLPr4uDMtAsgxsGqWiFcp7QMw1HhhlLWxqmYbA+mLZaqyMZUUOx 6b/P8WXT9P2I+qPVKWQ5CWyqpsEqm6P+72yg6LOM9kINvgwDhOa7cagMXIuMWYMH Rf97FL+K8p1eIy6AnvRHgFBMM5185uG+0YcJyVqtucDr/k8T/Om6ujAI6JbWtNe6 cLgaVAqKOYqCR4HC9bfVGSbd =eKSR -----END PGP SIGNATURE----- Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment (Stefan Hajnoczi) * tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits) vfio/pds: Send type for SUSPEND_STATUS command vfio/pds: fix return value in pds_vfio_get_lm_file() pds_core: Fix function header descriptions vfio: align capability structures vfio/type1: fix cap_migration information leak vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver vfio/pds: Add Kconfig and documentation vfio/pds: Add support for firmware recovery vfio/pds: Add support for dirty page tracking vfio/pds: Add VFIO live migration support vfio/pds: register with the pds_core PF pds_core: Require callers of register/unregister to pass PF drvdata vfio/pds: Initial support for pds VFIO driver vfio: Commonize combine_ranges for use in other VFIO drivers kvm/vfio: avoid bouncing the mutex when adding and deleting groups kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add() docs: vfio: Add vfio device cdev description vfio: Compile vfio_group infrastructure optionally vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev() ...	2023-08-30 20:36:01 -07:00
Yang Yingliang	eb501c2d96	iommufd/selftest: Don't leak the platform device memory when unloading the module It should call platform_device_unregister() instead of platform_device_del() to unregister and free the device. Fixes: `23a1b46f15` ("iommufd/selftest: Make the mock iommu driver into a real driver") Link: https://lore.kernel.org/r/20230816081318.1232865-1-yangyingliang@huawei.com Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-18 12:56:24 -03:00
Nicolin Chen	af4fde93c3	iommufd/selftest: Add coverage for IOMMU_GET_HW_INFO ioctl Add a mock_domain_hw_info function and an iommu_test_hw_info data structure. This allows to test the IOMMU_GET_HW_INFO ioctl passing the test_reg value for the mock_dev. Link: https://lore.kernel.org/r/20230818101033.4100-5-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-18 12:52:15 -03:00
Yi Liu	55dd4023ce	iommufd: Add IOMMU_GET_HW_INFO Under nested IOMMU translation, userspace owns the stage-1 translation table (e.g. the stage-1 page table of Intel VT-d or the context table of ARM SMMUv3, and etc.). Stage-1 translation tables are vendor specific, and need to be compatible with the underlying IOMMU hardware. Hence, userspace should know the IOMMU hardware capability before creating and configuring the stage-1 translation table to kernel. This adds IOMMU_GET_HW_INFO ioctl to query the IOMMU hardware information (a.k.a capability) for a given device. The returned data is vendor specific, userspace needs to decode it with the structure by the output @out_data_type field. As only physical devices have IOMMU hardware, so this will return error if the given device is not a physical device. Link: https://lore.kernel.org/r/20230818101033.4100-4-yi.l.liu@intel.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-18 12:52:15 -03:00
Stefan Hajnoczi	a881b49694	vfio: align capability structures The VFIO_DEVICE_GET_INFO, VFIO_DEVICE_GET_REGION_INFO, and VFIO_IOMMU_GET_INFO ioctls fill in an info struct followed by capability structs: +------+---------+---------+-----+ \| info \| caps[0] \| caps[1] \| ... \| +------+---------+---------+-----+ Both the info and capability struct sizes are not always multiples of sizeof(u64), leaving u64 fields in later capability structs misaligned. Userspace applications currently need to handle misalignment manually in order to support CPU architectures and programming languages with strict alignment requirements. Make life easier for userspace by ensuring alignment in the kernel. This is done by padding info struct definitions and by copying out zeroes after capability structs that are not aligned. The new layout is as follows: +------+---------+---+---------+-----+ \| info \| caps[0] \| 0 \| caps[1] \| ... \| +------+---------+---+---------+-----+ In this example caps[0] has a size that is not multiples of sizeof(u64), so zero padding is added to align the subsequent structure. Adding zero padding between structs does not break the uapi. The memory layout is specified by the info.cap_offset and caps[i].next fields filled in by the kernel. Applications use these field values to locate structs and are therefore unaffected by the addition of zero padding. Note that code that copies out info structs with padding is updated to always zero the struct and copy out as many bytes as userspace requested. This makes the code shorter and avoids potential information leaks by ensuring padding is initialized. Originally-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230809203144.2880050-1-stefanha@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-08-17 12:17:44 -06:00
Jason Gunthorpe	65aaca1134	iommufd: Remove iommufd_ref_to_users() This no longer has any callers, remove the function Kevin noticed that after commit `99f98a7c0d` ("iommufd: IOMMUFD_DESTROY should not increase the refcount") there was only one other user and it turns out the rework in commit `9227da7816` ("iommufd: Add iommufd_access_change_ioas(_id) helpers") got rid of the last one. Link: https://lore.kernel.org/r/0-v1-abb31bedd888+c1-iommufd_ref_to_users_jgg@nvidia.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-15 10:03:14 -03:00
Jason Gunthorpe	a35762dd14	Linux 6.5-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmTZISMeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGP+kH/RJWesO8dQ1b2jRh v1dexbytGUykROpmHBnJKDznwsSBnhDlI9Tu62dumWKRrCzwZto8Hag1QC2jYrra x7f3W087HdTSh3j5B92kGK/ZXgm4NwjVI078ujSv/e+qJMB3behpdL7uUkFUeeVV OaDhlSL4ILlyVOYPX3sHMiPutmZcXxe8/25o4aylpBrzlClKen7OODRz6gIwyVOR Nufgi/H5bkB4rDLOVI87HrxQMSpCtyGJtjTB78e/aRvIwYhJq16iuq+uBqOxQqgr anlg1nJ3r6/LphiT9H63xNFwIJDxtL7I1V8CQ9Jyvf/O4MNGSaM7sHw2l8ujTxU9 hf4GYyY= =loC2 -----END PGP SIGNATURE----- Merge tag 'v6.5-rc6' into iommufd for-next Required for following patches. Resolve merge conflict by using the hunk from the for-next branch and shifting the iommufd_object_deref_user() into iommufd_hw_pagetable_put() Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-15 10:01:26 -03:00
Jason Gunthorpe	23a1b46f15	iommufd/selftest: Make the mock iommu driver into a real driver I've avoided doing this because there is no way to make this happen without an intrusion into the core code. Up till now this has avoided needing the core code's probe path with some hackery - but now that default domains are becoming mandatory it is unavoidable. This became a serious problem when the core code stopped allowing partially registered iommu drivers in commit `14891af379` ("iommu: Move the iommu driver sysfs setup into iommu_init/deinit_device()") which breaks the selftest. That series was developed along with a second series that contained this patch so it was not noticed. Make it so that iommufd selftest can create a real iommu driver and bind it only to is own private bus. Add iommu_device_register_bus() as a core code helper to make this possible. It simply sets the right pointers and registers the notifier block. The mock driver then works like any normal driver should, with probe triggered by the bus ops When the bus->iommu_ops stuff is fully unwound we can probably do better here and remove this special case. Link: https://lore.kernel.org/r/15-v6-e8114faedade+425-iommu_all_defdom_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-08-14 20:13:45 -03:00
Nicolin Chen	c154660b6e	iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_REPLACE_IOAS coverage Add a new IOMMU_TEST_OP_ACCESS_REPLACE_IOAS to allow replacing the access->ioas, corresponding to the iommufd_access_replace() helper. Then add replace coverage as a part of user_copy test case, which basically repeats the copy test after replacing the old ioas with a new one. Link: https://lore.kernel.org/r/a4897f93d41c34b972213243b8dbf4c3832842e4.1690523699.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Nicolin Chen	70c16123d8	iommufd: Add iommufd_access_replace() API Taking advantage of the new iommufd_access_change_ioas_id helper, add an iommufd_access_replace() API for the VFIO emulated pathway to use. Link: https://lore.kernel.org/r/a3267b924fd5f45e0d3a1dd13a9237e923563862.1690523699.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Nicolin Chen	6129b59fcd	iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object Update iommufd_access_destroy_object() to call the new iommufd_access_change_ioas() helper. It is impossible to legitimately race iommufd_access_destroy_object() with iommufd_access_change_ioas() as iommufd_access_destroy_object() is only called once the refcount reache zero, so any concurrent iommufd_access_change_ioas() is already UAFing the memory. Link: https://lore.kernel.org/r/f9fbeca2cde7f8515da18d689b3e02a6a40a5e14.1690523699.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Nicolin Chen	9227da7816	iommufd: Add iommufd_access_change_ioas(_id) helpers The complication of the mutex and refcount will be amplified after we introduce the replace support for accesses. So, add a preparatory change of a constitutive helper iommufd_access_change_ioas() and its wrapper iommufd_access_change_ioas_id(). They can simply take care of existing iommufd_access_attach() and iommufd_access_detach(), properly sequencing the refcount puts so that they are truely at the end of the sequence after we know the IOAS pointer is not required any more. Link: https://lore.kernel.org/r/da0c462532193b447329c4eb975a596f47e49b70.1690523699.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Nicolin Chen	5d5c85ff62	iommufd: Allow passing in iopt_access_list_id to iopt_remove_access() This is a preparatory change for ioas replacement support for accesses. The replacement routine does an iopt_add_access() for a new IOAS first and then iopt_remove_access() for the old IOAS upon the success of the first call. However, the first call overrides the iopt_access_list_id in the access struct, resulting in iopt_remove_access() being unable to work on the old IOAS. Add an iopt_access_list_id as a parameter to iopt_remove_access, so the replacement routine can save the id before it gets overwritten. Pass the id in iopt_remove_access() for a proper cleanup. The existing callers should just pass in access->iopt_access_list_id. Link: https://lore.kernel.org/r/7bb939b9e0102da0c099572bb3de78ab7622221e.1690523699.git.nicolinc@nvidia.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-28 13:31:24 -03:00
Jason Gunthorpe	b7c822fa6b	iommufd: Set end correctly when doing batch carry Even though the test suite covers this it somehow became obscured that this wasn't working. The test iommufd_ioas.mock_domain.access_domain_destory would blow up rarely. end should be set to 1 because this just pushed an item, the carry, to the pfns list. Sometimes the test would blow up with: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 5 PID: 584 Comm: iommufd Not tainted 6.5.0-rc1-dirty #1236 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:batch_unpin+0xa2/0x100 [iommufd] Code: 17 48 81 fe ff ff 07 00 77 70 48 8b 15 b7 be 97 e2 48 85 d2 74 14 48 8b 14 fa 48 85 d2 74 0b 40 0f b6 f6 48 c1 e6 04 48 01 f2 <48> 8b 3a 48 c1 e0 06 89 ca 48 89 de 48 83 e7 f0 48 01 c7 e8 96 dc RSP: 0018:ffffc90001677a58 EFLAGS: 00010246 RAX: 00007f7e2646f000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 00000000fefc4c8d RDI: 0000000000fefc4c RBP: ffffc90001677a80 R08: 0000000000000048 R09: 0000000000000200 R10: 0000000000030b98 R11: ffffffff81f3bb40 R12: 0000000000000001 R13: ffff888101f75800 R14: ffffc90001677ad0 R15: 00000000000001fe FS: 00007f9323679740(0000) GS:ffff8881ba540000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000105ede003 CR4: 00000000003706a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_regs+0x5c/0x70 ? __die+0x1f/0x60 ? page_fault_oops+0x15d/0x440 ? lock_release+0xbc/0x240 ? exc_page_fault+0x4a4/0x970 ? asm_exc_page_fault+0x27/0x30 ? batch_unpin+0xa2/0x100 [iommufd] ? batch_unpin+0xba/0x100 [iommufd] __iopt_area_unfill_domain+0x198/0x430 [iommufd] ? __mutex_lock+0x8c/0xb80 ? __mutex_lock+0x6aa/0xb80 ? xa_erase+0x28/0x30 ? iopt_table_remove_domain+0x162/0x320 [iommufd] ? lock_release+0xbc/0x240 iopt_area_unfill_domain+0xd/0x10 [iommufd] iopt_table_remove_domain+0x195/0x320 [iommufd] iommufd_hw_pagetable_destroy+0xb3/0x110 [iommufd] iommufd_object_destroy_user+0x8e/0xf0 [iommufd] iommufd_device_detach+0xc5/0x140 [iommufd] iommufd_selftest_destroy+0x1f/0x70 [iommufd] iommufd_object_destroy_user+0x8e/0xf0 [iommufd] iommufd_destroy+0x3a/0x50 [iommufd] iommufd_fops_ioctl+0xfb/0x170 [iommufd] __x64_sys_ioctl+0x40d/0x9a0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Link: https://lore.kernel.org/r/3-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com Cc: <stable@vger.kernel.org> Fixes: `f394576eb1` ("iommufd: PFN handling for iopt_pages") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reported-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-27 11:27:20 -03:00
Jason Gunthorpe	99f98a7c0d	iommufd: IOMMUFD_DESTROY should not increase the refcount syzkaller found a race where IOMMUFD_DESTROY increments the refcount: obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY); if (IS_ERR(obj)) return PTR_ERR(obj); iommufd_ref_to_users(obj); /* See iommufd_ref_to_users() */ if (!iommufd_object_destroy_user(ucmd->ictx, obj)) As part of the sequence to join the two existing primitives together. Allowing the refcount the be elevated without holding the destroy_rwsem violates the assumption that all temporary refcount elevations are protected by destroy_rwsem. Racing IOMMUFD_DESTROY with iommufd_object_destroy_user() will cause spurious failures: WARNING: CPU: 0 PID: 3076 at drivers/iommu/iommufd/device.c:477 iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:478 Modules linked in: CPU: 0 PID: 3076 Comm: syz-executor.0 Not tainted 6.3.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023 RIP: 0010:iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:477 Code: e8 3d 4e 00 00 84 c0 74 01 c3 0f 0b c3 0f 1f 44 00 00 f3 0f 1e fa 48 89 fe 48 8b bf a8 00 00 00 e8 1d 4e 00 00 84 c0 74 01 c3 <0f> 0b c3 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d ae d0 00 00 00 41 RSP: 0018:ffffc90003067e08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888109ea0300 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000ffffffff RBP: 0000000000000004 R08: 0000000000000000 R09: ffff88810bbb3500 R10: ffff88810bbb3e48 R11: 0000000000000000 R12: ffffc90003067e88 R13: ffffc90003067ea8 R14: ffff888101249800 R15: 00000000fffffffe FS: 00007ff7254fe6c0(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555557262da8 CR3: 000000010a6fd000 CR4: 0000000000350ef0 Call Trace: <TASK> iommufd_test_create_access drivers/iommu/iommufd/selftest.c:596 [inline] iommufd_test+0x71c/0xcf0 drivers/iommu/iommufd/selftest.c:813 iommufd_fops_ioctl+0x10f/0x1b0 drivers/iommu/iommufd/main.c:337 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x84/0xc0 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The solution is to not increment the refcount on the IOMMUFD_DESTROY path at all. Instead use the xa_lock to serialize everything. The refcount check == 1 and xa_erase can be done under a single critical region. This avoids the need for any refcount incrementing. It has the downside that if userspace races destroy with other operations it will get an EBUSY instead of waiting, but this is kind of racing is already dangerous. Fixes: `2ff4bed7fe` ("iommufd: File descriptor, context, kconfig and makefiles") Link: https://lore.kernel.org/r/2-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+7574ebfe589049630608@syzkaller.appspotmail.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-27 11:27:19 -03:00
Jason Gunthorpe	6583c865de	iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC Test the basic flow. Link: https://lore.kernel.org/r/19-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:41 -03:00
Jason Gunthorpe	7a467e02b3	iommufd/selftest: Return the real idev id from selftest mock_domain Now that we actually call iommufd_device_bind() we can return the idev_id from that function to userspace for use in other APIs. Link: https://lore.kernel.org/r/18-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:36 -03:00
Jason Gunthorpe	7074d7bd67	iommufd: Add IOMMU_HWPT_ALLOC This allows userspace to manually create HWPTs on IOAS's and then use those HWPTs as inputs to iommufd_device_attach/replace(). Following series will extend this to allow creating iommu_domains with driver specific parameters. Link: https://lore.kernel.org/r/17-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:31 -03:00
Nicolin Chen	fa1ffdb9e2	iommufd/selftest: Test iommufd_device_replace() Allow the selftest to call the function on the mock idev, add some tests to exercise it. Link: https://lore.kernel.org/r/16-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:26 -03:00
Jason Gunthorpe	83f7bc6fdf	iommufd: Make destroy_rwsem use a lock class per object type The selftest invokes things like replace under the object lock of its idev which protects the idev in a similar way to a real user. Unfortunately this triggers lockdep. A lock class per type will solve the problem. Link: https://lore.kernel.org/r/15-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:21 -03:00
Jason Gunthorpe	e88d4ec154	iommufd: Add iommufd_device_replace() Replace allows all the devices in a group to move in one step to a new HWPT. Further, the HWPT move is done without going through a blocking domain so that the IOMMU driver can implement some level of non-distruption to ongoing DMA if that has meaning for it (eg for future special driver domains) Replace uses a lot of the same logic as normal attach, except the actual domain change over has different restrictions, and we are careful to sequence things so that failure is going to leave everything the way it was, and not get trapped in a blocking domain or something if there is ENOMEM. Link: https://lore.kernel.org/r/14-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:16 -03:00
Jason Gunthorpe	ea2d6124b5	iommufd: Reorganize iommufd_device_attach into iommufd_device_change_pt The code flow for first time attaching a PT and replacing a PT is very similar except for the lowest do_attach step. Reorganize this so that the do_attach step is a function pointer. Replace requires destroying the old HWPT once it is replaced. This destruction cannot be done under all the locks that are held in the function pointer, so the signature allows returning a HWPT which will be destroyed by the caller after everything is unlocked. Link: https://lore.kernel.org/r/12-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:06 -03:00
Jason Gunthorpe	31422dff18	iommufd: Fix locking around hwpt allocation Due to the auto_domains mechanism the ioas->mutex must be held until the hwpt is completely setup by iommufd_object_abort_and_destroy() or iommufd_object_finalize(). This prevents a concurrent iommufd_device_auto_get_domain() from seeing an incompletely initialized object through the ioas->hwpt_list. To make this more consistent move the unlock until after finalize. Fixes: `e8d5721003` ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/11-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:20:02 -03:00
Jason Gunthorpe	70eadc7fc7	iommufd: Allow a hwpt to be aborted after allocation During creation the hwpt must have the ioas->mutex held until the object is finalized. This means we need to be able to call iommufd_object_abort_and_destroy() while holding the mutex. Since iommufd_hw_pagetable_destroy() also needs the mutex this is problematic. Fix it by creating a special abort op for the object that can assume the caller is holding the lock, as required by the contract. The next patch will add another iommufd_object_abort_and_destroy() for a hwpt. Fixes: `e8d5721003` ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/10-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:57 -03:00
Jason Gunthorpe	17bad52708	iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc() Logically the HWPT should have the coherency set properly for the device that it is being created for when it is created. This was happening implicitly if the immediate_attach was set because iommufd_hw_pagetable_attach() does it as the first thing. Do it unconditionally so !immediate_attach works properly. Link: https://lore.kernel.org/r/9-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:52 -03:00
Jason Gunthorpe	d03f1336fd	iommufd: Move putting a hwpt to a helper function Next patch will need to call this from two places. Link: https://lore.kernel.org/r/8-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:47 -03:00
Jason Gunthorpe	1d149ab2e0	iommufd: Make sw_msi_start a group global The sw_msi_start is only set by the ARM drivers and it is always constant. Due to the way vfio/iommufd allow domains to be re-used between devices we have a built in assumption that there is only one value for sw_msi_start and it is global to the system. To make replace simpler where we may not reparse the iommu_get_resv_regions() move the sw_msi_start to the iommufd_group so it is always available once any HWPT has been attached. Link: https://lore.kernel.org/r/7-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:42 -03:00
Jason Gunthorpe	269c5238c5	iommufd: Use the iommufd_group to avoid duplicate MSI setup This only needs to be done once per group, not once per device. The once per device was a way to make the device list work. Since we are abandoning this we can optimize things a bit. Link: https://lore.kernel.org/r/6-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:37 -03:00
Jason Gunthorpe	34f327a985	iommufd: Keep track of each device's reserved regions instead of groups The driver facing API in the iommu core makes the reserved regions per-device. An algorithm in the core code consolidates the regions of all the devices in a group to return the group view. To allow for devices to be hotplugged into the group iommufd would re-load the entire group's reserved regions for each device, just in case they changed. Further iommufd already has to deal with duplicated/overlapping reserved regions as it must union all the groups together. Thus simplify all of this to just use the device reserved regions interface directly from the iommu driver. Link: https://lore.kernel.org/r/5-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:32 -03:00
Jason Gunthorpe	91a2e17e24	iommufd: Replace the hwpt->devices list with iommufd_group The devices list was used as a simple way to avoid having per-group information. Now that this seems to be unavoidable, just commit to per-group information fully and remove the devices list from the HWPT. The iommufd_group stores the currently assigned HWPT for the entire group and we can manage the per-device attach/detach with a list in the iommufd_group. For destruction the flow is organized to make the following patches easier, the actual call to iommufd_object_destroy_user() is done at the top of the call chain without holding any locks. The HWPT to be destroyed is returned out from the locked region to make this possible. Later patches create locking that requires this. Link: https://lore.kernel.org/r/3-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:22 -03:00
Jason Gunthorpe	3a3329a7f1	iommufd: Add iommufd_group When the hwpt to device attachment is fairly static we could get away with the simple approach of keeping track of the groups via a device list. But with replace this is infeasible. Add an automatically managed struct that is 1:1 with the iommu_group per-ictx so we can store the necessary tracking information there. Link: https://lore.kernel.org/r/2-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:19:17 -03:00
Jason Gunthorpe	d525a5b8cf	iommufd: Move isolated msi enforcement to iommufd_device_bind() With the recent rework this no longer needs to be done at domain attachment time, we know if the device is usable by iommufd when we bind it. The value of msi_device_has_isolated_msi() is not allowed to change while a driver is bound. Link: https://lore.kernel.org/r/1-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-07-26 10:16:43 -03:00
Yi Liu	c1cce6d079	vfio: Compile vfio_group infrastructure optionally vfio_group is not needed for vfio device cdev, so with vfio device cdev introduced, the vfio_group infrastructures can be compiled out if only cdev is needed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-26-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:20:50 -06:00
Yi Liu	1c9dc07487	iommufd: Add iommufd_ctx_from_fd() It's common to get a reference to the iommufd context from a given file descriptor. So adds an API for it. Existing users of this API are compiled only when IOMMUFD is enabled, so no need to have a stub for the IOMMUFD disabled case. Tested-by: Yanting Jiang <yanting.jiang@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-21-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:53 -06:00
Nicolin Chen	e23a6217f3	iommufd/device: Add iommufd_access_detach() API Previously, the detach routine is only done by the destroy(). And it was called by vfio_iommufd_emulated_unbind() when the device runs close(), so all the mappings in iopt were cleaned in that setup, when the call trace reaches this detach() routine. Now, there's a need of a detach uAPI, meaning that it does not only need a new iommufd_access_detach() API, but also requires access->ops->unmap() call as a cleanup. So add one. However, leaving that unprotected can introduce some potential of a race condition during the pin_/unpin_pages() call, where access->ioas->iopt is getting referenced. So, add an ioas_lock to protect the context of iopt referencings. Also, to allow the iommufd_access_unpin_pages() callback to happen via this unmap() call, add an ioas_unpin pointer, so the unpin routine won't be affected by the "access->ioas = NULL" trick. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-15-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:19:14 -06:00
Yi Liu	78d3df457a	iommufd: Add helper to retrieve iommufd_ctx and devid This is needed by the vfio-pci driver to report affected devices in the hot-reset for a given device. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-6-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:17:55 -06:00
Yi Liu	86b0a96c29	iommufd: Add iommufd_ctx_has_group() This adds the helper to check if any device within the given iommu_group has been bound with the iommufd_ctx. This is helpful for the checking on device ownership for the devices which have not been bound but cannot be bound to any other iommufd_ctx as the iommu_group has been bound. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-5-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:17:52 -06:00
Yi Liu	eda175dfe2	iommufd: Reserve all negative IDs in the iommufd xarray With this reservation, IOMMUFD users can encode the negative IDs for specific purposes. e.g. VFIO needs two reserved values to tell userspace the ID returned is not valid but has other meaning. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718105542.4138-4-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2023-07-25 10:17:48 -06:00
Linus Torvalds	31929ae008	iommufd for 6.5 Just two RC syzkaller fixes, both for the same basic issue, using the area pointer during an access forced unmap while the locks protecting it were let go. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJmGygAKCRCFwuHvBreF YVNSAQC7SgejTvwD6EYXr8AUDko1v0G0M/o60OrWIuC7xWiFPQD/RDwtItRLzf4h i+YCfMtn/7IB/uV/sRTF4m0HzudcDAM= =0fm4 -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Just two syzkaller fixes, both for the same basic issue: using the area pointer during an access forced unmap while the locks protecting it were let go" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd: Call iopt_area_contig_done() under the lock iommufd: Do not access the area pointer after unlocking	2023-06-29 20:57:27 -07:00
Jason Gunthorpe	dbe245cdf5	iommufd: Call iopt_area_contig_done() under the lock The iter internally holds a pointer to the area and iopt_area_contig_done() will dereference it. The pointer is not valid outside the iova_rwsem. syzkaller reports: BUG: KASAN: slab-use-after-free in iommufd_access_unpin_pages+0x363/0x370 Read of size 8 at addr ffff888022286e20 by task syz-executor669/5771 CPU: 0 PID: 5771 Comm: syz-executor669 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 Call Trace: <TASK> dump_stack_lvl+0xd9/0x150 print_address_description.constprop.0+0x2c/0x3c0 kasan_report+0x11c/0x130 iommufd_access_unpin_pages+0x363/0x370 iommufd_test_access_unmap+0x24b/0x390 iommufd_access_notify_unmap+0x24c/0x3a0 iopt_unmap_iova_range+0x4c4/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fec1dae3b19 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fec1da74308 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fec1db6b438 RCX: 00007fec1dae3b19 RDX: 0000000020000100 RSI: 0000000000003b86 RDI: 0000000000000003 RBP: 00007fec1db6b430 R08: 00007fec1da74700 R09: 0000000000000000 R10: 00007fec1da74700 R11: 0000000000000246 R12: 00007fec1db6b43c R13: 00007fec1db39074 R14: 6d6f692f7665642f R15: 0000000000022000 </TASK> Allocated by task 5770: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_kmalloc+0xa2/0xb0 iopt_alloc_area_pages+0x94/0x560 iopt_map_user_pages+0x205/0x4e0 iommufd_ioas_map+0x329/0x5f0 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 5770: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x40 ____kasan_slab_free+0x160/0x1c0 slab_free_freelist_hook+0x8b/0x1c0 __kmem_cache_free+0xaf/0x2d0 iopt_unmap_iova_range+0x288/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The parallel unmap free'd iter->area the instant the lock was released. Fixes: `51fe6141f0` ("iommufd: Data structure to provide IOVA to PFN mapping") Link: https://lore.kernel.org/r/2-v2-9a03761d445d+54-iommufd_syz2_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+6c8d756f238a75fc3eb8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/000000000000905eba05fe38e9f2@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-06-26 09:00:23 -03:00
Jason Gunthorpe	804ca14d04	iommufd: Do not access the area pointer after unlocking A concurrent unmap can trigger freeing of the area pointers while we are generating an unmapping notification for accesses. syzkaller reports: BUG: KASAN: slab-use-after-free in iopt_unmap_iova_range+0x5ba/0x5f0 Read of size 4 at addr ffff888075996184 by task syz-executor.2/31160 CPU: 1 PID: 31160 Comm: syz-executor.2 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 Call Trace: <TASK> dump_stack_lvl+0xd9/0x150 print_address_description.constprop.0+0x2c/0x3c0 kasan_report+0x11c/0x130 iopt_unmap_iova_range+0x5ba/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0812c8c169 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0813914168 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f0812dabf80 RCX: 00007f0812c8c169 RDX: 0000000020000100 RSI: 0000000000003b86 RDI: 0000000000000005 RBP: 00007f0812ce7ca1 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f0812ecfb1f R14: 00007f0813914300 R15: 0000000000022000 </TASK> Allocated by task 31160: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_kmalloc+0xa2/0xb0 iopt_alloc_area_pages+0x94/0x560 iopt_map_user_pages+0x205/0x4e0 iommufd_ioas_map+0x329/0x5f0 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 31161: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x40 ____kasan_slab_free+0x160/0x1c0 slab_free_freelist_hook+0x8b/0x1c0 __kmem_cache_free+0xaf/0x2d0 iopt_unmap_iova_range+0x288/0x5f0 iopt_unmap_all+0x27/0x50 iommufd_ioas_unmap+0x3d0/0x490 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff888075996100 which belongs to the cache kmalloc-cg-192 of size 192 The buggy address is located 132 bytes inside of freed 192-byte region [ffff888075996100, ffff8880759961c0) The buggy address belongs to the physical page: page:ffffea0001d66580 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x75996 memcg:ffff88801f1c2701 flags: 0xfff00000000200(slab\|node=0\|zone=1\|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000200 ffff88801244ddc0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000080100010 00000001ffffffff ffff88801f1c2701 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER\|__GFP_NOWARN\|__GFP_NORETRY), pid 31157, tgid 31154 (syz-executor.0), ts 1984547323469, free_ts 1983933451331 post_alloc_hook+0x2db/0x350 get_page_from_freelist+0xf41/0x2c00 __alloc_pages+0x1cb/0x4a0 alloc_pages+0x1aa/0x270 allocate_slab+0x25f/0x390 ___slab_alloc+0xa91/0x1400 __slab_alloc.constprop.0+0x56/0xa0 __kmem_cache_alloc_node+0x136/0x320 kmalloc_trace+0x26/0xe0 iommufd_test+0x1328/0x2c20 iommufd_fops_ioctl+0x317/0x4b0 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd page last free stack trace: free_unref_page_prepare+0x62e/0xcb0 free_unref_page_list+0xe3/0xa70 release_pages+0xcd8/0x1380 tlb_batch_pages_flush+0xa8/0x1a0 tlb_finish_mmu+0x14b/0x7e0 exit_mmap+0x2b2/0x930 __mmput+0x128/0x4c0 mmput+0x60/0x70 do_exit+0x9b0/0x29b0 do_group_exit+0xd4/0x2a0 get_signal+0x2318/0x25b0 arch_do_signal_or_restart+0x79/0x5c0 exit_to_user_mode_prepare+0x11f/0x240 syscall_exit_to_user_mode+0x1d/0x50 do_syscall_64+0x46/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Precompute what is needed to call the access function and do not check the area's num_accesses again as the pointer may not be valid anymore. Use a counter instead. Fixes: `51fe6141f0` ("iommufd: Data structure to provide IOVA to PFN mapping") Link: https://lore.kernel.org/r/1-v2-9a03761d445d+54-iommufd_syz2_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+1ad12d16afca0e7d2dde@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000001d40fc05fe385332@google.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-06-26 09:00:23 -03:00
Lorenzo Stoakes	0b295316b3	mm/gup: remove unused vmas parameter from pin_user_pages_remote() No invocation of pin_user_pages_remote() uses the vmas parameter, so remove it. This forms part of a larger patch set eliminating the use of the vmas parameters altogether. Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:25 -07:00
Jason Gunthorpe	692d42d411	Merge branch 'iommufd/for-rc' into for-next The following selftest patch requires both the bug fixes and the improvements of the selftest framework. * iommufd/for-rc: iommufd: Do not corrupt the pfn list when doing batch carry iommufd: Fix unpinning of pages when an access is present iommufd: Check for uptr overflow Linux 6.3-rc5 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 11:04:30 -03:00
Tom Rix	c52159b5be	iommufd/selftest: Set varaiable mock_iommu_device storage-class-specifier to static smatch reports: drivers/iommu/iommufd/selftest.c:295:21: warning: symbol 'mock_iommu_device' was not declared. Should it be static? This variable is only used in one file so it should be static. Fixes: `65c619ae06` ("iommufd/selftest: Make selftest create a more complete mock device") Link: https://lore.kernel.org/r/20230404002317.1912530-1-trix@redhat.com Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 11:02:39 -03:00
Jason Gunthorpe	13a0d1ae7e	iommufd: Do not corrupt the pfn list when doing batch carry If batch->end is 0 then setting npfns[0] before computing the new value of pfns will fail to adjust the pfn and result in various page accounting corruptions. It should be ordered after. This seems to result in various kinds of page meta-data corruption related failures: WARNING: CPU: 1 PID: 527 at mm/gup.c:75 try_grab_folio+0x503/0x740 Modules linked in: CPU: 1 PID: 527 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:try_grab_folio+0x503/0x740 Code: e3 01 48 89 de e8 6d c1 dd ff 48 85 db 0f 84 7c fe ff ff e8 4f bf dd ff 49 8d 47 ff 48 89 45 d0 e9 73 fe ff ff e8 3d bf dd ff <0f> 0b 31 db e9 d0 fc ff ff e8 2f bf dd ff 48 8b 5d c8 31 ff 48 89 RSP: 0018:ffffc90000f37908 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000fffffc02 RCX: ffffffff81504c26 RDX: 0000000000000000 RSI: ffff88800d030000 RDI: 0000000000000002 RBP: ffffc90000f37948 R08: 000000000003ca24 R09: 0000000000000008 R10: 000000000003ca00 R11: 0000000000000023 R12: ffffea000035d540 R13: 0000000000000001 R14: 0000000000000000 R15: ffffea000035d540 FS: 00007fecbf659740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200011c3 CR3: 000000000ef66006 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> internal_get_user_pages_fast+0xd32/0x2200 pin_user_pages_fast+0x65/0x90 pfn_reader_user_pin+0x376/0x390 pfn_reader_next+0x14a/0x7b0 pfn_reader_first+0x140/0x1b0 iopt_area_fill_domain+0x74/0x210 iopt_table_add_domain+0x30e/0x6e0 iommufd_device_selftest_attach+0x7f/0x140 iommufd_test+0x10ff/0x16f0 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Cc: <stable@vger.kernel.org> Fixes: `f394576eb1` ("iommufd: PFN handling for iopt_pages") Link: https://lore.kernel.org/r/3-v1-ceab6a4d7d7a+94-iommufd_syz_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 09:10:55 -03:00
Jason Gunthorpe	727c28c1ce	iommufd: Fix unpinning of pages when an access is present syzkaller found that the calculation of batch_last_index should use 'start_index' since at input to this function the batch is either empty or it has already been adjusted to cross any accesses so it will start at the point we are unmapping from. Getting this wrong causes the unmap to run over the end of the pages which corrupts pages that were never mapped. In most cases this triggers the num pinned debugging: WARNING: CPU: 0 PID: 557 at drivers/iommu/iommufd/pages.c:294 __iopt_area_unfill_domain+0x152/0x560 Modules linked in: CPU: 0 PID: 557 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__iopt_area_unfill_domain+0x152/0x560 Code: d2 0f ff 44 8b 64 24 54 48 8b 44 24 48 31 ff 44 89 e6 48 89 44 24 38 e8 fc d3 0f ff 45 85 e4 0f 85 eb 01 00 00 e8 0e d2 0f ff <0f> 0b e8 07 d2 0f ff 48 8b 44 24 38 89 5c 24 58 89 18 8b 44 24 54 RSP: 0018:ffffc9000108baf0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff821e3f85 RDX: 0000000000000000 RSI: ffff88800faf0000 RDI: 0000000000000002 RBP: ffffc9000108bd18 R08: 000000000003ca25 R09: 0000000000000014 R10: 000000000003ca00 R11: 0000000000000024 R12: 0000000000000004 R13: 0000000000000801 R14: 00000000000007ff R15: 0000000000000800 FS: 00007f3499ce1740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000243 CR3: 00000000179c2001 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> iopt_area_unfill_domain+0x32/0x40 iopt_table_remove_domain+0x23f/0x4c0 iommufd_device_selftest_detach+0x3a/0x90 iommufd_selftest_destroy+0x55/0x70 iommufd_object_destroy_user+0xce/0x130 iommufd_destroy+0xa2/0xc0 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Also add some useful WARN_ON sanity checks. Cc: <stable@vger.kernel.org> Fixes: `8d160cd4d5` ("iommufd: Algorithms for PFN storage") Link: https://lore.kernel.org/r/2-v1-ceab6a4d7d7a+94-iommufd_syz_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 09:10:55 -03:00
Jason Gunthorpe	e439570133	iommufd: Check for uptr overflow syzkaller found that setting up a map with a user VA that wraps past zero can trigger WARN_ONs, particularly from pin_user_pages weirdly returning 0 due to invalid arguments. Prevent creating a pages with a uptr and size that would math overflow. WARNING: CPU: 0 PID: 518 at drivers/iommu/iommufd/pages.c:793 pfn_reader_user_pin+0x2e6/0x390 Modules linked in: CPU: 0 PID: 518 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:pfn_reader_user_pin+0x2e6/0x390 Code: b1 11 e9 25 fe ff ff e8 28 e4 0f ff 31 ff 48 89 de e8 2e e6 0f ff 48 85 db 74 0a e8 14 e4 0f ff e9 4d ff ff ff e8 0a e4 0f ff <0f> 0b bb f2 ff ff ff e9 3c ff ff ff e8 f9 e3 0f ff ba 01 00 00 00 RSP: 0018:ffffc90000f9fa30 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff821e2b72 RDX: 0000000000000000 RSI: ffff888014184680 RDI: 0000000000000002 RBP: ffffc90000f9fa78 R08: 00000000000000ff R09: 0000000079de6f4e R10: ffffc90000f9f790 R11: ffff888014185418 R12: ffffc90000f9fc60 R13: 0000000000000002 R14: ffff888007879800 R15: 0000000000000000 FS: 00007f4227555740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000043 CR3: 000000000e748005 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> pfn_reader_next+0x14a/0x7b0 ? interval_tree_double_span_iter_update+0x11a/0x140 pfn_reader_first+0x140/0x1b0 iopt_pages_rw_slow+0x71/0x280 ? __this_cpu_preempt_check+0x20/0x30 iopt_pages_rw_access+0x2b2/0x5b0 iommufd_access_rw+0x19f/0x2f0 iommufd_test+0xd11/0x16f0 ? write_comp_data+0x2f/0x90 iommufd_fops_ioctl+0x206/0x330 __x64_sys_ioctl+0x10e/0x160 ? __pfx_iommufd_fops_ioctl+0x10/0x10 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Cc: <stable@vger.kernel.org> Fixes: `8d160cd4d5` ("iommufd: Algorithms for PFN storage") Link: https://lore.kernel.org/r/1-v1-ceab6a4d7d7a+94-iommufd_syz_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-04-04 09:10:55 -03:00
Jason Gunthorpe	9fdf791612	Merge branch 'vfio_mdev_ops' into iommufd.git for-next Yi Liu says =================== The .bind_iommufd op of vfio emulated devices are either empty or does nothing. This is different with the vfio physical devices, to add vfio device cdev, need to make them act the same. This series first makes the .bind_iommufd op of vfio emulated devices to create iommufd_access, this introduces a new iommufd API. Then let the driver that does not provide .bind_iommufd op to use the vfio emulated iommufd op set. This makes all vfio device drivers have consistent iommufd operations, which is good for adding new device uAPIs in the device cdev =================== * branch 'vfio_mdev_ops': vfio: Check the presence for iommufd callbacks in __vfio_register_dev() vfio/mdev: Uses the vfio emulated iommufd ops set in the mdev sample drivers vfio-iommufd: Make vfio_iommufd_emulated_bind() return iommufd_access ID vfio-iommufd: No need to record iommufd_ctx in vfio_device iommufd: Create access in vfio_iommufd_emulated_bind() iommu/iommufd: Pass iommufd_ctx pointer in iommufd_get_ioas() Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:57 -03:00
Yi Liu	632fda7f91	vfio-iommufd: Make vfio_iommufd_emulated_bind() return iommufd_access ID vfio device cdev needs to return iommufd_access ID to userspace if bind_iommufd succeeds. Link: https://lore.kernel.org/r/20230327093351.44505-5-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:32 -03:00
Nicolin Chen	54b47585db	iommufd: Create access in vfio_iommufd_emulated_bind() There are needs to created iommufd_access prior to have an IOAS and set IOAS later. Like the vfio device cdev needs to have an iommufd object to represent the bond (iommufd_access) and IOAS replacement. Moves the iommufd_access_create() call into vfio_iommufd_emulated_bind(), making it symmetric with the __vfio_iommufd_access_destroy() call in the vfio_iommufd_emulated_unbind(). This means an access is created/destroyed by the bind()/unbind(), and the vfio_iommufd_emulated_attach_ioas() only updates the access->ioas pointer. Since vfio_iommufd_emulated_bind() does not provide ioas_id, drop it from the argument list of iommufd_access_create(). Instead, add a new access API iommufd_access_attach() to set the access->ioas pointer. Also, set vdev->iommufd_attached accordingly, similar to the physical pathway. Link: https://lore.kernel.org/r/20230327093351.44505-3-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-31 13:43:31 -03:00
Yi Liu	325de95029	iommu/iommufd: Pass iommufd_ctx pointer in iommufd_get_ioas() No need to pass the iommufd_ucmd pointer. Link: https://lore.kernel.org/r/20230327093351.44505-2-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-29 16:52:41 -03:00
Jason Gunthorpe	fd8c1a4aee	iommufd/selftest: Catch overflow of uptr and length syzkaller hits a WARN_ON when trying to have a uptr close to UINTPTR_MAX: WARNING: CPU: 1 PID: 393 at drivers/iommu/iommufd/selftest.c:403 iommufd_test+0xb19/0x16f0 Modules linked in: CPU: 1 PID: 393 Comm: repro Not tainted 6.2.0-c9c3395d5e3d #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:iommufd_test+0xb19/0x16f0 Code: 94 c4 31 ff 44 89 e6 e8 a5 54 17 ff 45 84 e4 0f 85 bb 0b 00 00 41 be fb ff ff ff e8 31 53 17 ff e9 a0 f7 ff ff e8 27 53 17 ff <0f> 0b 41 be 8 RSP: 0018:ffffc90000eabdc0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8214c487 RDX: 0000000000000000 RSI: ffff88800f5c8000 RDI: 0000000000000002 RBP: ffffc90000eabe48 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 00000000cd2b0000 R13: 00000000cd2af000 R14: 0000000000000000 R15: ffffc90000eabe68 FS: 00007f94d76d5740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000043 CR3: 0000000006880006 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> ? write_comp_data+0x2f/0x90 iommufd_fops_ioctl+0x1ef/0x310 __x64_sys_ioctl+0x10e/0x160 ? __pfx_iommufd_fops_ioctl+0x10/0x10 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Check that the user memory range doesn't overflow. Fixes: `f4b20bb34c` ("iommufd: Add kernel support for testing iommufd") Link: https://lore.kernel.org/r/0-v1-95390ed1df8d+8f-iommufd_mock_overflow_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Link: https://lore.kernel.org/r/Y/hOiilV1wJvu/Hv@xpf.sh.intel.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-10 15:29:59 -04:00
Jason Gunthorpe	65c619ae06	iommufd/selftest: Make selftest create a more complete mock device iommufd wants to use more infrastructure, like the iommu_group, that the mock device does not support. Create a more complete mock device that can go through the whole cycle of ownership, blocking domain, and has an iommu_group. This requires creating a real struct device on a real bus to be able to connect it to a iommu_group. Unfortunately we cannot formally attach the mock iommu driver as an actual driver as the iommu core does not allow more than one driver or provide a general way for busses to link to iommus. This can be solved with a little hack to open code the dev_iommus struct. With this infrastructure things work exactly the same as the normal domain path, including the auto domains mechanism and direct attach of hwpts. As the created hwpt is now an autodomain it is no longer required to destroy it and trying to do so will trigger a failure. Link: https://lore.kernel.org/r/11-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 13:06:11 -04:00
Jason Gunthorpe	2cfdeaa07b	iommufd/selftest: Rename the sefltest 'device_id' to 'stdev_id' It is too confusing now that we have the 'dev_id' as part of the main interface. Make it clear this is the special selftest device object. This object is analogous to the VFIO device FD. Link: https://lore.kernel.org/r/7-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:58 -04:00
Jason Gunthorpe	339fbf3ae1	iommufd: Make iommufd_hw_pagetable_alloc() do iopt_table_add_domain() The HWPT is always linked to an IOAS and once a HWPT exists its domain should be fully mapped. This ended up being split up into device.c during a two phase creation that was a bit confusing. Move the iopt_table_add_domain() into iommufd_hw_pagetable_alloc() by having it call back to device.c to complete the domain attach in the required order. Calling iommufd_hw_pagetable_alloc() with immediate_attach = false will work on most drivers, but notably the SMMU drivers will fail because they can't decide what kind of domain to create until they are attached. This will be fixed when the domain_alloc function can take in a struct device. Link: https://lore.kernel.org/r/6-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:57 -04:00
Jason Gunthorpe	7e7ec8a569	iommufd: Move iommufd_device to iommufd_private.h hw_pagetable.c will need this in the next patches. Link: https://lore.kernel.org/r/5-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:57 -04:00
Jason Gunthorpe	25cde97d95	iommufd: Move ioas related HWPT destruction into iommufd_hw_pagetable_destroy() A HWPT is permanently associated with an IOAS when it is created, remove the strange situation where a refcount != 0 HWPT can have been disconnected from the IOAS by putting all the IOAS related destruction in the object destroy function. Initializing a HWPT is two stages, we have to allocate it, attach it to a device and then populate the domain. Once the domain is populated it is fully linked to the IOAS. Arrange things so that all the error unwinds flow through the iommufd_hw_pagetable_destroy() and allow it to handle all cases. Link: https://lore.kernel.org/r/4-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:57 -04:00
Jason Gunthorpe	342b9cab8e	iommufd: Consistently manage hwpt_item This should be added immediately after every iopt_table_add_domain(), and deleted after every iopt_table_remove_domain() under the ioas->mutex. Tidy things to be consistent. Link: https://lore.kernel.org/r/3-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:57 -04:00
Jason Gunthorpe	7214c1c85f	iommufd: Add iommufd_lock_obj() around the auto-domains hwpts A later patch will require this locking - currently under the ioas mutex the hwpt can not have a 0 reference and be on the list. Link: https://lore.kernel.org/r/2-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:56 -04:00
Jason Gunthorpe	085fcc7eb7	iommufd: Assert devices_lock for iommufd_hw_pagetable_has_group() The hwpt->devices list is locked by this, make it clearer. Link: https://lore.kernel.org/r/1-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-03-06 10:51:56 -04:00
Jason Gunthorpe	b4ff830eca	iommufd: Do not add the same hwpt to the ioas->hwpt_list twice The hwpt is added to the hwpt_list only during its creation, it is never added again. This hunk is some missed leftover from rework. Adding it twice will corrupt the linked list in some cases. It effects HWPT specific attachment, which is something the test suite cannot cover until we can create a legitimate struct device with a non-system iommu "driver" (ie we need the bus removed from the iommu code) Cc: stable@vger.kernel.org Fixes: `e8d5721003` ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/1-v1-4336b5cb2fe4+1d7-iommufd_hwpt_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-15 21:37:48 -04:00
Jason Gunthorpe	b3551ead61	iommufd: Make sure to zero vfio_iommu_type1_info before copying to user Missed a zero initialization here. Most of the struct is filled with a copy_from_user(), however minsz for that copy is smaller than the actual struct by 8 bytes, thus we don't fill the padding. Cc: stable@vger.kernel.org # 6.1+ Fixes: `d624d6652a` ("iommufd: vfio container FD ioctl compatibility") Link: https://lore.kernel.org/r/0-v1-a74499ece799+1a-iommufd_get_info_leak_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: syzbot+cb1e0978f6bf46b83a58@syzkaller.appspotmail.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-14 16:49:55 -04:00
Jason Gunthorpe	bed9e516f1	Merge branch 'vfio-no-iommu' into iommufd.git for-next Shared branch with VFIO for the no-iommu support. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-03 15:55:49 -04:00
Jason Gunthorpe	c9a397cee9	vfio: Support VFIO_NOIOMMU with iommufd Add a small amount of emulation to vfio_compat to accept the SET_IOMMU to VFIO_NOIOMMU_IOMMU and have vfio just ignore iommufd if it is working on a no-iommu enabled device. Move the enable_unsafe_noiommu_mode module out of container.c into vfio_main.c so that it is always available even if VFIO_CONTAINER=n. This passes Alex's mini-test: https://github.com/awilliam/tests/blob/master/vfio-noiommu-pci-device-open.c Link: https://lore.kernel.org/r/0-v3-480cd64a16f7+1ad0-iommufd_noiommu_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-02-03 15:45:23 -04:00
Jason Gunthorpe	fd9f2a9122	Merge branch 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu intoiommufd/for-next Jason Gunthorpe says: ==================== iommufd follows the same design as KVM and uses memory cgroups to limit the amount of kernel memory a iommufd file descriptor can pin down. The various internal data structures already use GFP_KERNEL_ACCOUNT to charge its own memory. However, one of the biggest consumers of kernel memory is the IOPTEs stored under the iommu_domain and these allocations are not tracked. This series is the first step in fixing it. The iommu driver contract already includes a 'gfp' argument to the map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then having the driver allocate the IOPTE tables with that flag will capture a significant amount of the allocations. Update the iommu_map() API to pass in the GFP argument, and fix all call sites. Replace iommu_map_atomic(). Audit the "enterprise" iommu drivers to make sure they do the right thing. Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are already correct. A follow up series will be needed to capture the allocations made when the iommu_domain itself is allocated, which will complete the job. ==================== * 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/s390: Use GFP_KERNEL in sleepable contexts iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s iommu/intel: Use GFP_KERNEL in sleepable contexts iommu/intel: Support the gfp argument to the map_pages op iommu/intel: Add a gfp parameter to alloc_pgtable_page() iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map() iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous() iommu: Add a gfp parameter to iommu_map_sg() iommu: Remove iommu_map_atomic() iommu: Add a gfp parameter to iommu_map() Link: https://lore.kernel.org/linux-iommu/0-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-01-30 13:54:35 -04:00
Jason Gunthorpe	e787a38e31	iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map() iommufd follows the same design as KVM and uses memory cgroups to limit the amount of kernel memory a iommufd file descriptor can pin down. The various internal data structures already use GFP_KERNEL_ACCOUNT. However, one of the biggest consumers of kernel memory is the IOPTEs stored under the iommu_domain. Many drivers will allocate these at iommu_map() time and will trivially do the right thing if we pass in GFP_KERNEL_ACCOUNT. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-01-25 11:52:04 +01:00
Jason Gunthorpe	1369459b2e	iommu: Add a gfp parameter to iommu_map() The internal mechanisms support this, but instead of exposting the gfp to the caller it wrappers it into iommu_map() and iommu_map_atomic() Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/1-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-01-25 11:52:00 +01:00
Yi Liu	84798f2849	iommufd: Add three missing structures in ucmd_buffer struct iommu_ioas_copy, struct iommu_option and struct iommu_vfio_ioas are missed in ucmd_buffer. Although they are smaller than the size of ucmd_buffer, it is safer to list them in ucmd_buffer explicitly. Fixes: `aad37e71d5` ("iommufd: IOCTLs for the io_pagetable") Fixes: `d624d6652a` ("iommufd: vfio container FD ioctl compatibility") Link: https://lore.kernel.org/r/20230120122040.280219-1-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-01-23 14:29:04 -04:00
Jason Gunthorpe	25fc417f79	iommufd: Convert to msi_device_has_isolated_msi() Trivially use the new API. Link: https://lore.kernel.org/r/4-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-01-11 16:27:23 -04:00
Linus Torvalds	08cdc21579	iommufd for 6.2 iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY5ct7wAKCRCFwuHvBreF YZZ5AQDciXfcgXLt0UBEmWupNb0f/asT6tk717pdsKm8kAZMNAEAsIyLiKT5HqGl s7fAu+CQ1pr9+9NKGevD+frw8Solsw4= =jJkd -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd implementation from Jason Gunthorpe: "iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA" For more background, see the extended explanations in Jason's pull request: https://lore.kernel.org/lkml/Y5dzTU8dlmXTbzoJ@nvidia.com/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (62 commits) iommufd: Change the order of MSI setup iommufd: Improve a few unclear bits of code iommufd: Fix comment typos vfio: Move vfio group specific code into group.c vfio: Refactor dma APIs for emulated devices vfio: Wrap vfio group module init/clean code into helpers vfio: Refactor vfio_device open and close vfio: Make vfio_device_open() truly device specific vfio: Swap order of vfio_device_container_register() and open_device() vfio: Set device->group in helper function vfio: Create wrappers for group register/unregister vfio: Move the sanity check of the group to vfio_create_group() vfio: Simplify vfio_create_group() iommufd: Allow iommufd to supply /dev/vfio/vfio vfio: Make vfio_container optionally compiled vfio: Move container related MODULE_ALIAS statements into container.c vfio-iommufd: Support iommufd for emulated VFIO devices vfio-iommufd: Support iommufd for physical VFIO devices vfio-iommufd: Allow iommufd to be used in place of a container fd vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent() ...	2022-12-14 09:15:43 -08:00
Jason Gunthorpe	d6c55c0a20	iommufd: Change the order of MSI setup Eric points out this is wrong for the rare case of someone using allow_unsafe_interrupts on ARM. We always have to setup the MSI window in the domain if the iommu driver asks for it. Move the iommu_get_msi_cookie() setup to the top of the function and always do it, regardless of the security mode. Add checks to iommufd_device_setup_msi() to ensure the driver is not doing something incomprehensible. No current driver will set both a HW and SW MSI window, or have more than one SW MSI window. Fixes: `e8d5721003` ("iommufd: Add kAPI toward external drivers for physical devices") Link: https://lore.kernel.org/r/3-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reported-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-09 15:24:30 -04:00
Jason Gunthorpe	a26fa39206	iommufd: Improve a few unclear bits of code Correct a few items noticed late in review: - We should assert that the math in batch_clear_carry() doesn't underflow - user->locked should be -1 not 0 sicne we just did mmput - npages should not have been recalculated, it already has that value No functional change. Fixes: `8d160cd4d5` ("iommufd: Algorithms for PFN storage") Link: https://lore.kernel.org/r/2-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-09 15:20:37 -04:00
Jason Gunthorpe	c9b8a83a8f	iommufd: Fix comment typos Repair some typos in comments that were noticed late in the review cycle. Fixes: `f394576eb1` ("iommufd: PFN handling for iopt_pages") Link: https://lore.kernel.org/r/1-v1-0362a1a1c034+98-iommufd_fixes1_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-09 15:20:37 -04:00
Jason Gunthorpe	01f70cbb26	iommufd: Allow iommufd to supply /dev/vfio/vfio If the VFIO container is compiled out, give a kconfig option for iommufd to provide the miscdev node with the same name and permissions as vfio uses. The compatibility node supports the same ioctls as VFIO and automatically enables the VFIO compatible pinned page accounting mode. Link: https://lore.kernel.org/r/10-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-12-02 11:52:04 -04:00
Jason Gunthorpe	52f528583b	iommufd: Add additional invariant assertions These are on performance paths so we protect them using the CONFIG_IOMMUFD_TEST to not take a hit during normal operation. These are useful when running the test suite and syzkaller to find data structure inconsistencies early. Link: https://lore.kernel.org/r/18-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	e26eed4f62	iommufd: Add some fault injection points This increases the coverage the fail_nth test gets, as well as via syzkaller. Link: https://lore.kernel.org/r/17-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	f4b20bb34c	iommufd: Add kernel support for testing iommufd Provide a mock kernel module for the iommu_domain that allows it to run without any HW and the mocking provides a way to directly validate that the PFNs loaded into the iommu_domain are correct. This exposes the access kAPI toward userspace to allow userspace to explore the functionality of pages.c and io_pagetable.c The mock also simulates the rare case of PAGE_SIZE > iommu page size as the mock will operate at a 2K iommu page size. This allows exercising all of the calculations to support this mismatch. This is also intended to support syzkaller exploring the same space. However, it is an unusually invasive config option to enable all of this. The config option should not be enabled in a production kernel. Link: https://lore.kernel.org/r/16-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390 Tested-by: Eric Auger <eric.auger@redhat.com> # aarch64 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	d624d6652a	iommufd: vfio container FD ioctl compatibility iommufd can directly implement the /dev/vfio/vfio container IOCTLs by mapping them into io_pagetable operations. A userspace application can test against iommufd and confirm compatibility then simply make a small change to open /dev/iommu instead of /dev/vfio/vfio. For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and then all applications will use the compatibility path with no code changes. A later series allows /dev/vfio/vfio to be directly provided by iommufd, which allows the rlimit mode to work the same as well. This series just provides the iommufd side of compatibility. Actually linking this to VFIO_SET_CONTAINER is a followup series, with a link in the cover letter. Internally the compatibility API uses a normal IOAS object that, like vfio, is automatically allocated when the first device is attached. Userspace can also query or set this IOAS object directly using the IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only features while still using the VFIO style map/unmap ioctls. While this is enough to operate qemu, it has a few differences: - Resource limits rely on memory cgroups to bound what userspace can do instead of the module parameter dma_entry_limit. - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at a solution where iommufd would import a special DMABUF. This is to avoid further propogating the follow_pfn() security problem. - A full audit for pedantic compatibility details (eg errnos, etc) has not yet been done - powerpc SPAPR is left out, as it is not connected to the iommu_domain framework. It seems interest in SPAPR is minimal as it is currently non-working in v6.1-rc1. They will have to convert to the iommu subsystem framework to enjoy iommfd. The following are not going to be implemented and we expect to remove them from VFIO type1: - SW access 'dirty tracking'. As discussed in the cover letter this will be done in VFIO. - VFIO_TYPE1_NESTING_IOMMU https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/ - VFIO_DMA_MAP_FLAG_VADDR https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/ Link: https://lore.kernel.org/r/15-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	8d40205f60	iommufd: Add kAPI toward external drivers for kernel access Kernel access is the mode that VFIO "mdevs" use. In this case there is no struct device and no IOMMU connection. iommufd acts as a record keeper for accesses and returns the actual struct pages back to the caller to use however they need. eg with kmap or the DMA API. Each caller must create a struct iommufd_access with iommufd_access_create(), similar to how iommufd_device_bind() works. Using this struct the caller can access blocks of IOVA using iommufd_access_pin_pages() or iommufd_access_rw(). Callers must provide a callback that immediately unpins any IOVA being used within a range. This happens if userspace unmaps the IOVA under the pin. The implementation forwards the access requests directly to the iopt infrastructure that manages the iopt_pages_access. Link: https://lore.kernel.org/r/14-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	e8d5721003	iommufd: Add kAPI toward external drivers for physical devices Add the four functions external drivers need to connect physical DMA to the IOMMUFD: iommufd_device_bind() / iommufd_device_unbind() Register the device with iommufd and establish security isolation. iommufd_device_attach() / iommufd_device_detach() Connect a bound device to a page table Binding a device creates a device object ID in the uAPI, however the generic API does not yet provide any IOCTLs to manipulate them. Link: https://lore.kernel.org/r/13-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	ea4acfac57	iommufd: Add a HW pagetable object The hw_pagetable object exposes the internal struct iommu_domain's to userspace. An iommu_domain is required when any DMA device attaches to an IOAS to control the io page table through the iommu driver. For compatibility with VFIO the hw_pagetable is automatically created when a DMA device is attached to the IOAS. If a compatible iommu_domain already exists then the hw_pagetable associated with it is used for the attachment. In the initial series there is no iommufd uAPI for the hw_pagetable object. The next patch provides driver facing APIs for IO page table attachment that allows drivers to accept either an IOAS or a hw_pagetable ID and for the driver to return the hw_pagetable ID that was auto-selected from an IOAS. The expectation is the driver will provide uAPI through its own FD for attaching its device to iommufd. This allows userspace to learn the mapping of devices to iommu_domains and to override the automatic attachment. The future HW specific interface will allow userspace to create hw_pagetable objects using iommu_domains with IOMMU driver specific parameters. This infrastructure will allow linking those domains to IOAS's and devices. Link: https://lore.kernel.org/r/12-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	aad37e71d5	iommufd: IOCTLs for the io_pagetable Connect the IOAS to its IOCTL interface. This exposes most of the functionality in the io_pagetable to userspace. This is intended to be the core of the generic interface that IOMMUFD will provide. Every IOMMU driver should be able to implement an iommu_domain that is compatible with this generic mechanism. It is also designed to be easy to use for simple non virtual machine monitor users, like DPDK: - Universal simple support for all IOMMUs (no PPC special path) - An IOVA allocator that considers the aperture and the allowed/reserved ranges - io_pagetable allows any number of iommu_domains to be connected to the IOAS - Automatic allocation and re-use of iommu_domains Along with room in the design to add non-generic features to cater to specific HW functionality. Link: https://lore.kernel.org/r/11-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	51fe6141f0	iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	8d160cd4d5	iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	f394576eb1	iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00
Jason Gunthorpe	2ff4bed7fe	iommufd: File descriptor, context, kconfig and makefiles This is the basic infrastructure of a new miscdevice to hold the iommufd IOCTL API. It provides: - A miscdevice to create file descriptors to run the IOCTL interface over - A table based ioctl dispatch and centralized extendable pre-validation step - An xarray mapping userspace ID's to kernel objects. The design has multiple inter-related objects held within in a single IOMMUFD fd - A simple usage count to build a graph of object relations and protect against hostile userspace racing ioctls The only IOCTL provided in this patch is the generic 'destroy any object by handle' operation. Link: https://lore.kernel.org/r/6-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-11-30 20:16:49 -04:00

1 2 3

141 Commits