linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-26 12:52:30 +00:00

Author	SHA1	Message	Date
farah kassabri	ba24b5ec78	accel/habanalabs: split user interrupts pending list Currently driver maintain one list for both pending user interrupts which seeks to wait till CQ reaches it's target value and also the ones that seeks to get timestamp records when the CQ reaches it's target value. This causes delay in handling the waiters which gets higher priority than the timestamp records. In order to solve this, let's split the list into two, one for each case and each one is protected by it's own spinlock. Waiters will be handled within the interrupt context first, then the timestamp records will be set. Freeing the timestamp related memory will be handled in a workqueue. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
farah kassabri	1157b5d6b3	accel/habanalabs: optimize timestamp registration handler Currently we use dynamic allocation inside the irq handler in order to allocate free node to be used for the free jobs. This operation is expensive, especially when we deal with large burst of events records that get released at the same time. The alternative is to have pre allocated pool of free nodes and just fetch nodes from this pool at irq handling time instead of allocating them. In case the pool becomes full, then the driver will fallback to dynamic allocations. As part of the optimization also update the unregister flow upon re-using a timestamp record, by making the operation much simpler and quicker. We already have the record in the registration flow and now we just seek to re-use with different interrupt. Therefore, no need to look for buffer according to the user handle. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
farah kassabri	0165994c21	accel/habanalabs: fix bug in timestamp interrupt handling There is a potential race between user thread seeking to re-use a timestamp record with new interrupt id, while this record is still in the middle of interrupt handling and it is about to be freed. Imagine the driver set the record in_use to 0 and only then fill the free_node information. This might lead to unpleasant scenario where the new registration thread detects the record as free to use, and change the cq buff address. That will cause the free_node to get the wrong buffer address to put refcount to. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	d89d329a2b	accel/habanalabs: tiny refactor of hl_map_dmabuf() alloc_sgt_from_device_pages() includes relatively many parameters, and in a subsequent change another offset parameter is going to be added. Using structure fields directly when calling this function, and in hl_map_dmabuf() it is done twice, makes it a little bit difficult to understand the meaning of the parameters. To make it clearer, assign the required values into local variables with explicit names, and use the variables when calling the function. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	0b75cb5b24	accel/habanalabs: export dma-buf only if size/offset multiples of PAGE_SIZE It is currently allowed for a user to export dma-buf with size and offset that are not multiples of PAGE_SIZE. The exported memory is mapped for the importer device, and there it will be rounded to PAGE_SIZE, leading to actually exporting more than the user intended to. To make the user be aware of it, accept only size and offset which are multiple of PAGE_SIZE. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:22 +03:00
Tomer Tayar	efbca048c6	accel/habanalabs: use exported size from dma_buf and not from phys_pg_pack The 'exported_size' member in 'struct hl_vm_phys_pg_pack' is used to keep the exported dma-buf size, to be later used when the buffer is mapped. However it is possible that the same phys_pg_pack will be exported more than once, and independently of when the mapping takes place. Remove this member from the phys_pg_pack structure, and simply use the size in the dma-buf object as the exported size when mapping. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Tomer Tayar	dfdbc55a9c	accel/habanalabs: always pass exported size to alloc_sgt_from_device_pages() For Gaudi1 the exported dma-buf is always composed of a single page, and therefore the exported size is equal to this page's size. When calling alloc_sgt_from_device_pages(), we pass 0 as the exported size and internally calculate it as "number of pages * page size". This makes alloc_sgt_from_device_pages() less clear, because the exported size parameter is not understood as a restriction on the pages' size. Modify to always pass the exported size explicitly. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	051868d93c	accel/habanalabs: prevent sending heartbeat before events are enabled After the heartbeat mechanism is now expanded to be used also for EQ health check, we shouldn't send heartbeat messages to FW before driver allow events to be received from FW. Because if the driver will send two heartbeats before it enables events to be received from FW, then the EQ health check will fail and reset the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	764bfd138f	accel/habanalabs/gaudi2: add eq health check using irq This is the second patch for applying the eq health check mechanism which will add support for the interrupt flow for gaudi2 asic. More info about the interrupt mechanism: set a dedicated msix for the eq error interrupt, and add interrupt handler for it. when FW detects some issue with EQ like EQ_FULL, it'll raise that interrupt and driver should reset the device. Driver will inform the FW which msix index to use through the already existing handshake mechanism which will send msix info message to fw. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
farah kassabri	7c4130e6dd	accel/habanalabs/gaudi2: handle eq health heartbeat check Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to indicate an eq failure like EQ full. This patch will add implementation for the eq heartbeat for gaudi2 asic. More info about the heartbeat mechanism: Expand the heartbeat mechanism to monitor a new event that will be sent from FW upon receiving heartbeat message. that way driver can know that the eq is working or not. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Moti Haimovski	72bff371b2	accel/habanalabs/gaudi2: print power-mode changes Print to kernel log any device power mode changes events reported by the FW. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Hen Alon	0648c4d080	accel/habanalabs: add tsc clock sampling to clock sync info Add tsc clock to clock sync info, to enable using this clock for sampling and sync it with device time. Signed-off-by: Hen Alon <halon@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Dafna Hirschfeld	e0f452802b	accel/habanalabs: fix inline doc typos Fix two typos Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Dafna Hirschfeld	ab574f6a81	accel/habanalabs: disable events ioctls on control device Because it is not used and also, for graceful reset to work those ioctls should run on the compute device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
David Meriin	2b76129c5a	accel/habanalabs: move cpucp interface to linux/habanalabs The CPUCP interface is moved to a shared folder outside of accel as a pre-requisite to upstream the NIC drivers that will also include this file. Signed-off-by: David Meriin <dmeriin@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Ofir Bitton	d261b0ab13	accel/habanalabs/gaudi2: include block id in ECC error reporting During ECC event handling, Memory wrapper id was mistakenly printed as block id. Fix the print and in addition fetch the actual block-id from firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:21 +03:00
Benjamin Dotan	10d260f655	accel/habanalabs: improve etf configuration coresight ETF blocks have different size. As a result, sync packets need to be aligned based on fifo size. Signed-off-by: Benjamin Dotan <bdotan@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Justin Stitt	571bfeb48a	accel/habanalabs: refactor deprecated strncpy `strncpy` is deprecated for use on NUL-terminated destination strings [1]. A suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on its destination buffer argument which is _not_ the case for `strncpy`! There is likely no bug happening in this case since HL_STR_MAX is strictly larger than all source strings. Nonetheless, prefer a safer and more robust interface. It should also be noted that `strscpy` will not pad like `strncpy`. If this NUL-padding behavior is _required_ we should use `strscpy_pad` instead of `strscpy`. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Christophe JAILLET	90f3de6162	accel/habanalabs/gaudi2: Fix incorrect string length computation in gaudi2_psoc_razwi_get_engines() snprintf() returns the "number of characters which would be generated for the given input", not the size really generated. In order to avoid too large values for 'str_size' (and potential negative values for "PSOC_RAZWI_ENG_STR_SIZE - str_size") use scnprintf() instead of snprintf(). Fixes: `c0e6df9160` ("accel/habanalabs: fix address decode RAZWI handling") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Justin Stitt	a45d5cf09d	accel/habanalabs: refactor deprecated strncpy to strscpy_pad `strncpy` is deprecated for use on NUL-terminated destination strings [1]. We see that `prop->cpucp_info.card_name` is supposed to be NUL-terminated based on its usage within `__hwmon_device_register()` (wherein it's called "name"): \| if (name && (!strlen(name) \|\| strpbrk(name, "-* \t\n"))) \| dev_warn(dev, \| "hwmon: '%s' is not a valid name attribute, please fix\n", \| name); A suitable replacement is `strscpy_pad` [2] due to the fact that it guarantees both NUL-termination and NUL-padding on its destination buffer. NUL-padding on `prop->cpucp_info.card_name` is not strictly necessary as `hdev->prop` is explicitly zero-initialized but should be used regardless as it gets copied out to userspace directly -- as per Kees' suggestion. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Benjamin Dotan	428f6882a6	accel/habanalabs: fix ETR/ETF flush logic When config_etr or config_etf are called we need to validate the parameters that are passed into them to make sure the requested operation is valid. Signed-off-by: Benjamin Dotan <bdotan@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Benjamin Dotan	cf1ed52d12	accel/habanalabs/gaudi2 : remove psoc_arc access Because firmware is blocking PSOC_ARC_DBG, we need to disable access to this block. Signed-off-by: Benjamin Dotan <bdotan@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Igor Grinberg	01ab1629ad	accel/habanalabs/gaudi2: prepare to remove cpu_rst_status The soft reset has transitioned to CPUCP packet instead of plain register write and is about to be removed from the struct cpu_dyn_regs. As a preparation for removing the cpu_rst_status field from struct cpu_dyn_regs, switch to use the plain macro - this keeps the backward compatibility. Signed-off-by: Igor Grinberg <igrinberg@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Tomer Tayar	57963ff8ad	accel/habanalabs: Move ioctls to the device specific ioctls range To use drm_ioctl(), move the ioctls to the device specific ioctls range at [DRM_COMMAND_BASE, DRM_COMMAND_END). Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Tomer Tayar	fe77368c0f	accel/habanalabs: register compute device as an accel device Register the compute device as an accel device, and remove the creation of the habanalabs compute char device. The IOCTLs in this patch are still handled by the current driver handler. Moving to DRM IOCTL handling requires moving the IOCTLs numbers to a specific range, so it will be handled in subsequent patches. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:20 +03:00
Ofir Bitton	a8ab1a81cc	accel/habanalabs: add info ioctl for engine error reports User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the exact engine reported an error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Tomer Tayar	10926f6005	accel/habanalabs: set default device release watchdog T/O as 30 sec After being notified about certain errors, user is expected to finish his post-errors actions and to release the device within some timeout, after which is deice is being reset. The default timeout value is 5 sec, which in some case is not enough for a user application to collect debug data. Increase the default value to 30 sec. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Dani Liberman	8887279092	accel/habanalabs: handle f/w reserved dram space request It is possible for FW to request reserved space in dram. If the device supports this option, it will retrieve the size from the f/w and will reserve it. Currently we add the common code infrastructure to support it. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Oded Gabbay	fa46c7bb50	accel/habanalabs/gaudi2: fix missing check of kernel ctx If we are initializing the kernel context when we have a Gaudi2 device, we don't need to do any late initializing of that context with specific Gaudi2 code. Reviewed-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Igor Grinberg	15c0bb1623	accel/habanalabs/gaudi2: prepare to remove soft_rst_irq The soft reset has transitioned to CPUCP packet instead of plain register write and is about to be removed from the struct cpu_dyn_regs. As a preparation for removing the gic_host_soft_rst_irq field from struct cpu_dyn_regs, switch to use the plain macro - this keeps the backward compatibility. Signed-off-by: Igor Grinberg <igrinberg@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Ofir Bitton	1e3a78270b	accel/habanalabs/gaudi2: unsecure tpc count registers As TPC kernels now must use those registers we unsecure them. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Tomer Tayar	5a8487ac54	accel/habanalabs/gaudi2: un-secure register for engine cores interrupt The F/W dynamically allocates one of the PSOC scratchpad registers for the engine cores, so they can raise events towards the F/W. To allow the engine cores to access this register, this register must be non-secured. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Juerg Haefliger	b03dc2b621	accel/habanalabs/gaudi: Add MODULE_FIRMWARE macros The module loads firmware so add MODULE_FIRMWARE macros to provide that information via modinfo. Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Ofir Bitton	d33c3d0541	accel/habanalabs: dump temperature threshold boot error Add dump of an error reported from f/w during boot time. This error indicates a failure with setting temperature threshold. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:19 +03:00
Oded Gabbay	37d72439a4	accel/habanalabs: reset device if scrubbing failed If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:19 +03:00
Oded Gabbay	89803af535	accel/habanalabs: remove pdev check on idle check Our simulator supports idle check so no need anymore to check if pdev exists. Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>	2023-10-09 12:37:18 +03:00
farah kassabri	2da9f8d805	accel/habanalabs: fix wait_for_interrupt abortion flow When the driver needs to abort waiters for interrupts, for cases such as critical events that occur and driver need to do hard reset, in such scenario the driver will complete the fence to wake up the waiting thread, and will set the fence error indication. The return value of the completion API will be greater than 0 since it will return the timeout, but as this indicates successful completion, the driver should mark it as aborted. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
farah kassabri	eaa43a06b7	accel/habanalabs: Allow single timestamp registration request at a time Protect against concurrency of user requesting to register a timestamp offset (where the driver fills the timestamp when the command submission has finished executing) to a specific user interrupt ID. The protection is basically to allow only one timestamp registration request to be handled at a time. This is needed because the user can decide to re-use a timestamp offset (register an already registered offset, to a different interrupt ID). This means the request will cause the timestamp node to move from one interrupt list to another interrupt list. In such scenario, without proper protection, we could end up adding the same node twice to the interrupts wait lists. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	964b1f675d	accel/habanalabs: rename fd_list to hpriv_list Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	942f18c56d	accel/habanalabs: call put_pid after hpriv list is updated Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	2b541cf913	accel/habanalabs: print return code when process termination fails As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
farah kassabri	bffd2f16ae	accel/habanalabs: fix standalone preboot descriptor request The preboot used to statically allocate memory for the comms descriptor on the device memory when driver requested the descriptor information. Now preboot moved to dynamic memory allocation where it wants to check the size the driver expects vs. what the f/w expects. Note there are no backward compatibility issues as older f/w versions simply ignore this value. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Dani Liberman	43d8acce60	accel/habanalabs: handle arc farm razwi Implement razwi handling for arc farm and add it to arc farm sei event handler. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Ofir Bitton	f17182d036	accel/habanalabs: stop fetching MME SBTE error cause Because in this case we have only a single possible cause, we can safely stop fetching the cause from firmware. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Koby Elbaz	e4a97d6b62	accel/habanalabs: set device status 'malfunction' while in rmmod hl_device_status() returns the status of an acquired device. If a device is going down (following an rmmod cmd), it should be marked as an unusable/malfunctioning device, and hence should not be acquired. However, since this was not the case so far (i.e., a device going down would inaccurately return 'in reset' status allowing the user to acquire the device) it introduced a bug where as part of a reset flow, the driver could not kill processes that have not run yet, and since those processes aren't blocked from reacquiring a device, we get eventually a new flow of a driver attempting to kill all processes in a list that can't be ever really empty. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Tomer Tayar	e7b2902a33	accel/habanalabs: print task name upon creation of a user context It is useful for debug to know which user process have acquired the device. Add this info to the relevant debug print, in addition to the already printed user context's ASID. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:18 +03:00
Tomer Tayar	7dccb064a7	accel/habanalabs: print task name and request code upon ioctl failure When an ioctl fails, it is useful to know what is the task command name and the full ioctl request code, in addition to the task pid and the ioctl number. Add the additional information to the relevant debug error prints. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00
Ofir Bitton	c6a4f256ae	accel/habanalabs: notify user about undefined opcode event In order for user to be aware of undefined opcode events, we must store all relevant information and notify user about the failure. The user will fetch the stored info via info ioctl. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00
Tomer Tayar	a35c997601	accel/habanalabs: update pending reset flags with new reset requests If hl_device_cond_reset() is called while a reset is already pending but hasn't started, the reset request will be dropped. If the flags of the new request are more severe, e.g. a hard reset while the pending reset is a compute reset, the eventual reset won't be suitable for the device status. To prevent such cases, update the pending reset flags with the new requests flags before the requests are dropped. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00
Tomer Tayar	5d89ce6f8c	accel/habanalabs: prevent immediate hard reset due to 2 adjacent H/W events When a H/W event is received while a user is registered to events, no immediate hard reset will happen, and instead the user will be notified and will have some time to handle it and eventually release the device, after which the reset will be done. If a user, as part of the handling and as part of the cleanup steps towards releasing the device, unregisters from receiving those events, and at that time an adjacent H/W event is received, it will be assumed that the user is not registered to events and thus an immediate hard reset is required. To prevent such an unwanted immediate reset, modify the driver to perform it if the user is not registered to events AND we don't already have a pending reset for a previous H/W event. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2023-10-09 12:37:17 +03:00

1 2 3 4 5

220 Commits