If successful ida_simple_get() calls are not undone when needed, some
additional memory may be allocated and wasted.
Here, an ID between 0 and MAX_INT is required. If this ID is >=100, it is
not taken into account and is wasted. It should be released.
Instead of calling ida_simple_remove(), take advantage of the 'max'
parameter to require the ID not to be too big. Should it be too big, it
is not allocated and don't need to be freed.
While at it, use ida_alloc_xxx()/ida_free() instead to
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Fixes: db1a0ae214 ("drm/nouveau/bl: Assign different names to interfaces")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Lyude Paul <lyude@redhat.com>
[Fixed formatting warning from checkpatch]
Signed-off-by: Lyude Paul <lyude@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/9ba85bca59df6813dc029e743a836451d5173221.1644386541.git.christophe.jaillet@wanadoo.fr
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJ1X+YACgkQxWXV+ddt
WDs75g/8DsrDBe/Sbn/bDLRw30D0StC5xRhaVJLNcAhSfIyyoBo0EKifd2nXdNSn
rdc3pvRToPh2X1YQ/5FmtxuYW12N4pVOfWtXKdoFFXabMVJetDBSnS8KNzxBd9Ys
OkiKj2qb8H+1uNwWVLxfbFBNWWPyCQLe/2DDmePxOAszs4YUPvwxffyA8oclyHb5
wW0qOJAC76Lz6y5Wo/GJrtAdlvWwb3t8+IDUKgGPT1BEIOE/fna+MhvEwqvCdY9F
PPU5sWIXkAv3addgrcBHVX5HdAzRC0jAv2lHsttfu4dEOmTzw8dCTUh/IzSRa3fj
IVy7AIGR+ZR++d4tOhAEZBe1KAqntY3UVmXY19cKTHOLLWFjv4XkKw8KJzsDiHUt
rczedAFdgRRo9QIgwSsAU9Zi8DSG56BSenxsqFzqiL5BDDX1bUFXCegNYR42GfB/
8E89eYkBCXxP6XeA1+44EOalCdqLKuQOyEijTUxn0UDGdqHHB1Gd/IUQDU5fpCqo
kX6gdgMhNmjGJ/zETfOdFrYZaHYJUiiIRc6z8SCgM3JIPBgVw0FAa99Hs8Pl+eJn
idmpfnHn6Xvmq46FISVRWgolkuj7VzTEOM65rNgOY3889Vk9Qt2qjI/DvYPGDs9Y
9PQI1FY+2E+ZqbNY8dZpXDii+6y36aAmGR1B10x3545+C36CA/0=
=S4KK
-----END PGP SIGNATURE-----
Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Regression fixes in zone activation:
- move a loop invariant out of the loop to avoid checking space
status
- properly handle unlimited activation
Other fixes:
- for subpage, force the free space v2 mount to avoid a warning and
make it easy to switch a filesystem on different page size systems
- export sysfs status of exclusive operation 'balance paused', so the
user space tools can recognize it and allow adding a device with
paused balance
- fix assertion failure when logging directory key range item"
* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: export the balance paused state of exclusive operation
btrfs: fix assertion failure when logging directory key range item
btrfs: zoned: activate block group properly on unlimited active zone device
btrfs: zoned: move non-changing condition check out of the loop
btrfs: force v2 space cache usage for subpage mount
Highlights include:
Stable fixes:
- Fix a socket leak when setting up an AF_LOCAL RPC client
- Ensure that knfsd connects to the gss-proxy daemon on setup
Bugfixes:
- Fix a refcount leak when migrating a task off an offlined transport
- Don't gratuitously invalidate inode attributes on delegation return
- Don't leak sockets in xs_local_connect()
- Ensure timely close of disconnected AF_LOCAL sockets
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJ1f3wACgkQZwvnipYK
APKKIRAAmVUswcfRQ9wSz5wW6DCFU9hdsN9JD4pAcPvWAYGo8fqmn3I3qe/iaBCf
rrJF38SfQVygtthmAY4CBBwOiVxm2fvqanML2lta+ZUU15MqoH2px3kMYemRyulJ
9/2yP25AUSgkmwdEmm69hIXJkEJa3dsjg+LajQZ5X01DgKSfpObS5s9t/upM9kve
Wqz5QRr+aJnZuuYYJWxNmXZ4XQEkzHccg3aSswB6bEsEGNXKo8NnWryrSMnWTW1y
rQCb0e+gxpoVFgV3ngP1r9xT2l2ISbJIIhTPoj5hSjSVlFvQlIEyHtGA2vuIEZH9
hPJAnaSc7Xb+QER6XfZkTxjW+jtMl5OmMKkWUcUmHiYv2KIM8dUAd3ANnbDBCUvw
C5bGF907Qjqs5d2VdfsbisT9ikyn+xw6SFxcr9HYyH2T3dIsC1A8P9uUvn/afwUQ
EPfQIsIEDeufo6O8KLfF+gCO9kbk9rdaP8Bv3B2H94aRs1yYde9bJpa7QABncGbA
otWehkX/AbrIa4Zjp1ELzcVJxlIl+/AtxzCdGY2me1Ds388U/RKsyDWwXuGynLP6
98ycdtHWVyoJ48L5kZowuj8/3tEB998En5hh0HSuAd0DYkAuGxaSGb+iuwKi/M0H
+D1wZxef49r2ggQkEOsllTEjJKSHcq1+vCVASZ8ITEbcVUSiO90=
=LSoH
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a socket leak when setting up an AF_LOCAL RPC client
- Ensure that knfsd connects to the gss-proxy daemon on setup
Bugfixes:
- Fix a refcount leak when migrating a task off an offlined transport
- Don't gratuitously invalidate inode attributes on delegation return
- Don't leak sockets in xs_local_connect()
- Ensure timely close of disconnected AF_LOCAL sockets"
* tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: attempt AF_LOCAL connect on setup"
SUNRPC: Ensure gss-proxy connects on setup
SUNRPC: Ensure timely close of disconnected AF_LOCAL sockets
SUNRPC: Don't leak sockets in xs_local_connect()
NFSv4: Don't invalidate inode attributes on delegation return
SUNRPC release the transport of a relocated task with an assigned transport
* Account for family 17h event renumberings in AMD PMU emulation
* Remove CPUID leaf 0xA on AMD processors
* Fix lockdep issue with locking all vCPUs
* Fix loss of A/D bits in SPTEs
* Fix syzkaller issue with invalid guest state
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmJ1Vf4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNaUQgAgygZ2KsejlJCYGtEkAsjcpdzmPVL
8j42nWB673/PLZ6GrDXcFnRwQaBIT+0YrES5VHTkTI996d2T/yHII2L4G3DQtUGm
6L3qYqrjJlX2WjbYGvYzkJ6m4EzcstUfPYNO2Qzfvbl2y/wz64HlAhNdymwMX2UU
GPUVoo3EHeobJdZVKFMe7eI6r/uY1/uPdsKqNjnlWI73op+tc7mMRN5+SlQDgQvR
kmzw+Nk0J+PERQO+D+fm1vUdXDQ8hiI7LtTBIUX7rf47IqVlHNHC8frC94PX3W3E
l2sVS+LzRQRqCgFgQ2ay2gYkl078VL8z4A6vWpcWSmaToEYE7VcAnHqb0Q==
=6gt2
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86:
- Account for family 17h event renumberings in AMD PMU emulation
- Remove CPUID leaf 0xA on AMD processors
- Fix lockdep issue with locking all vCPUs
- Fix loss of A/D bits in SPTEs
- Fix syzkaller issue with invalid guest state"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state
KVM: SEV: Mark nested locking of vcpu->lock
kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits
KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()
KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)
* A fix to relocate the DTB early in boot, in cases where the bootloader
doesn't put the DTB in a region that will end up mapped by the kernel.
This manifests as a crash early in boot on a handful of
configurations.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEAM520YNJYN/OiG3470yhUCzLq0EFAmJ1TjQTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRDvTKFQLMurQQLmD/4477/Ax+QlHXJ4stM1m4VaXybzJ1qb
gvjsV/xnOwwLcQU6683B6mm0LQ+LGXfJHwveYwMQs1dsoWEeOjmwpadO5qjfNv55
dbJh/9QtS4BYWHBySvumU2hv3Wyn4kO+7f/bLoxgIR0CupMBslTdP1oo8jFQeHBt
bL/wWUU/MVEg9FAB0t2cSkaJVcxvCkyRk5MYJUok2eHedTpwOwtIGTnSpv00EA8z
V1aDN3Y8MJuqRVSsbN/hGCKB1WrKkP5K7qxnvp5tH+68G3t30zc5NvVtrH+VZNiT
YTod4kf5Y3JlekboNz17O/WDP3BjpP5QBga9K7m64de9vSZjiEtM4Ze4i3A/C1xe
Z/LaizIy1H+1ehA3tWPQH6MwLFVUx4XNVWPQfDhAGtAyXTA5Epl028V9mvvUClfg
l63f9cWEEGsy2DVg9kU9MfzNdro5iJARL/pYUFQCyoEiUQIkl7E1Eh9XXqQwIozK
3rhvRL9DbELYTK65xKUXwBZcnuCBgUPNQvf7/AZ02qPMQzmy+NlWQLKYQvEYCH5U
IbYV0LCwpYAi2tdxgkzv7wkI86m5NLtRRKeD/eZLMEozlkkxVtq1fPeVXTDA2d8N
RdsF16H9CW9Bdf+COyhR4xn2IpxqSV/aVOsjYA0F87D5GUJk+AQp4w7Pm2NdH2fb
yw8HwqrnD8+66A==
=JBfr
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fix from Palmer Dabbelt:
- A fix to relocate the DTB early in boot, in cases where the
bootloader doesn't put the DTB in a region that will end up
mapped by the kernel.
This manifests as a crash early in boot on a handful of
configurations.
* tag 'riscv-for-linus-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: relocate DTB if it's outside memory region
Read stale PTP Tx timestamps from PHY on cleanup.
After running out of Tx timestamps request handlers, hardware (HW) stops
reporting finished requests. Function ice_ptp_tx_tstamp_cleanup() used
to only clean up stale handlers in driver and was leaving the hardware
registers not read. Not reading stale PTP Tx timestamps prevents next
interrupts from arriving and makes timestamping unusable.
Fixes: ea9b847cda ("ice: enable transmit timestamps for E810 devices")
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The iAVF driver uses 3 virtchnl op codes to communicate with the PF
regarding the VF Tx queues:
* VIRTCHNL_OP_CONFIG_VSI_QUEUES configures the hardware and firmware
logic for the Tx queues
* VIRTCHNL_OP_ENABLE_QUEUES configures the queue interrupts
* VIRTCHNL_OP_DISABLE_QUEUES disables the queue interrupts and Tx rings.
There is a bug in the iAVF driver due to the race condition between VF
reset request and shutdown being executed in parallel. This leads to a
break in logic and VIRTCHNL_OP_DISABLE_QUEUES is not being sent.
If this occurs, the PF driver never cleans up the Tx queues. This results
in leaving behind stale Tx queue settings in the hardware and firmware.
The most obvious outcome is that upon the next
VIRTCHNL_OP_CONFIG_VSI_QUEUES, the PF will fail to program the Tx
scheduler node due to a lack of space.
We need to protect ICE driver against such situation.
To fix this, make sure we clear existing stale settings out when
handling VIRTCHNL_OP_CONFIG_VSI_QUEUES. This ensures we remove the
previous settings.
Calling ice_vf_vsi_dis_single_txq should be safe as it will do nothing if
the queue is not configured. The function already handles the case when the
Tx queue is not currently configured and exits with a 0 return in that
case.
Fixes: 7ad15440ac ("ice: Refactor VIRTCHNL_OP_CONFIG_VSI_QUEUES handling")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anatolii Gerasymenko <anatolii.gerasymenko@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Function ice_plug_aux_dev() assigns pf->adev field too early prior
aux device initialization and on other side ice_unplug_aux_dev()
starts aux device deinit and at the end assigns NULL to pf->adev.
This is wrong because pf->adev should always be non-NULL only when
aux device is fully initialized and ready. This wrong order causes
a crash when ice_send_event_to_aux() call occurs because that function
depends on non-NULL value of pf->adev and does not assume that
aux device is half-initialized or half-destroyed.
After order correction the race window is tiny but it is still there,
as Leon mentioned and manipulation with pf->adev needs to be protected
by mutex.
Fix (un-)plugging functions so pf->adev field is set after aux device
init and prior aux device destroy and protect pf->adev assignment by
new mutex. This mutex is also held during ice_send_event_to_aux()
call to ensure that aux device is valid during that call.
Note that device lock used ice_send_event_to_aux() needs to be kept
to avoid race with aux drv unload.
Reproducer:
cycle=1
while :;do
echo "#### Cycle: $cycle"
ip link set ens7f0 mtu 9000
ip link add bond0 type bond mode 1 miimon 100
ip link set bond0 up
ifenslave bond0 ens7f0
ip link set bond0 mtu 9000
ethtool -L ens7f0 combined 1
ip link del bond0
ip link set ens7f0 mtu 1500
sleep 1
let cycle++
done
In short when the device is added/removed to/from bond the aux device
is unplugged/plugged. When MTU of the device is changed an event is
sent to aux device asynchronously. This can race with (un)plugging
operation and because pf->adev is set too early (plug) or too late
(unplug) the function ice_send_event_to_aux() can touch uninitialized
or destroyed fields. In the case of crash below pf->adev->dev.mutex.
Crash:
[ 53.372066] bond0: (slave ens7f0): making interface the new active one
[ 53.378622] bond0: (slave ens7f0): Enslaving as an active interface with an u
p link
[ 53.386294] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 53.549104] bond0: (slave ens7f1): Enslaving as a backup interface with an up
link
[ 54.118906] ice 0000:ca:00.0 ens7f0: Number of in use tx queues changed inval
idating tc mappings. Priority traffic classification disabled!
[ 54.233374] ice 0000:ca:00.1 ens7f1: Number of in use tx queues changed inval
idating tc mappings. Priority traffic classification disabled!
[ 54.248204] bond0: (slave ens7f0): Releasing backup interface
[ 54.253955] bond0: (slave ens7f1): making interface the new active one
[ 54.274875] bond0: (slave ens7f1): Releasing backup interface
[ 54.289153] bond0 (unregistering): Released all slaves
[ 55.383179] MII link monitoring set to 100 ms
[ 55.398696] bond0: (slave ens7f0): making interface the new active one
[ 55.405241] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 55.405289] bond0: (slave ens7f0): Enslaving as an active interface with an u
p link
[ 55.412198] #PF: supervisor write access in kernel mode
[ 55.412200] #PF: error_code(0x0002) - not-present page
[ 55.412201] PGD 25d2ad067 P4D 0
[ 55.412204] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 55.412207] CPU: 0 PID: 403 Comm: kworker/0:2 Kdump: loaded Tainted: G S
5.17.0-13579-g57f2d6540f03 #1
[ 55.429094] bond0: (slave ens7f1): Enslaving as a backup interface with an up
link
[ 55.430224] Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.4.4 10/07/
2021
[ 55.430226] Workqueue: ice ice_service_task [ice]
[ 55.468169] RIP: 0010:mutex_unlock+0x10/0x20
[ 55.472439] Code: 0f b1 13 74 96 eb e0 4c 89 ee eb d8 e8 79 54 ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 40 ef 01 00 31 d2 <f0> 48 0f b1 17 75 01 c3 e9 e3 fe ff ff 0f 1f 00 0f 1f 44 00 00 48
[ 55.491186] RSP: 0018:ff4454230d7d7e28 EFLAGS: 00010246
[ 55.496413] RAX: ff1a79b208b08000 RBX: ff1a79b2182e8880 RCX: 0000000000000001
[ 55.503545] RDX: 0000000000000000 RSI: ff4454230d7d7db0 RDI: 0000000000000080
[ 55.510678] RBP: ff1a79d1c7e48b68 R08: ff4454230d7d7db0 R09: 0000000000000041
[ 55.517812] R10: 00000000000000a5 R11: 00000000000006e6 R12: ff1a79d1c7e48bc0
[ 55.524945] R13: 0000000000000000 R14: ff1a79d0ffc305c0 R15: 0000000000000000
[ 55.532076] FS: 0000000000000000(0000) GS:ff1a79d0ffc00000(0000) knlGS:0000000000000000
[ 55.540163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 55.545908] CR2: 0000000000000080 CR3: 00000003487ae003 CR4: 0000000000771ef0
[ 55.553041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 55.560173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 55.567305] PKRU: 55555554
[ 55.570018] Call Trace:
[ 55.572474] <TASK>
[ 55.574579] ice_service_task+0xaab/0xef0 [ice]
[ 55.579130] process_one_work+0x1c5/0x390
[ 55.583141] ? process_one_work+0x390/0x390
[ 55.587326] worker_thread+0x30/0x360
[ 55.590994] ? process_one_work+0x390/0x390
[ 55.595180] kthread+0xe6/0x110
[ 55.598325] ? kthread_complete_and_exit+0x20/0x20
[ 55.603116] ret_from_fork+0x1f/0x30
[ 55.606698] </TASK>
Fixes: f9f5301e7e ("ice: Register auxiliary device to provide RDMA")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Exit to userspace with an emulation error if KVM encounters an injected
exception with invalid guest state, in addition to the existing check of
bailing if there's a pending exception (KVM doesn't support emulating
exceptions except when emulating real mode via vm86).
In theory, KVM should never get to such a situation as KVM is supposed to
exit to userspace before injecting an exception with invalid guest state.
But in practice, userspace can intervene and manually inject an exception
and/or stuff registers to force invalid guest state while a previously
injected exception is awaiting reinjection.
Fixes: fc4fad79fc ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220502221850.131873-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
source and target vcpu->locks. Unfortunately there is an 8 subclass
limit, so a new subclass cannot be used for each vCPU. Instead maintain
ownership of the first vcpu's mutex.dep_map using a role specific
subclass: source vs target. Release the other vcpu's mutex.dep_maps.
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reported-by: John Sperbeck<jsperbeck@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220502165807.529624-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A few recent regressions in rxe's multicast code, and some old driver
bugs:
- Error case unwind bug in rxe for rkeys
- Dot not call netdev functions under a spinlock in rxe multicast code
- Use the proper BH lock type in rxe multicast code
- Fix idrma deadlock and crash
- Add a missing flush to drain irdma QPs when in error
- Fix high userspace latency in irdma during destroy due to
synchronize_rcu()
- Rare race in siw MPA processing
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYnUqeQAKCRCFwuHvBreF
Yf58AQCNUQZlmEiuBid6WxggXPW/MM5sxJdOqZeX+Ddbmm7swAEAidtoVBILozLC
ltd8+P8qNdccqOZDatgqYYSpXUfHIA4=
=Idcg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"A few recent regressions in rxe's multicast code, and some old driver
bugs:
- Error case unwind bug in rxe for rkeys
- Dot not call netdev functions under a spinlock in rxe multicast
code
- Use the proper BH lock type in rxe multicast code
- Fix idrma deadlock and crash
- Add a missing flush to drain irdma QPs when in error
- Fix high userspace latency in irdma during destroy due to
synchronize_rcu()
- Rare race in siw MPA processing"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Change mcg_lock to a _bh lock
RDMA/rxe: Do not call dev_mc_add/del() under a spinlock
RDMA/siw: Fix a condition race issue in MPA request processing
RDMA/irdma: Fix possible crash due to NULL netdev in notifier
RDMA/irdma: Reduce iWARP QP destroy time
RDMA/irdma: Flush iWARP QP if modified to ERR from RTR state
RDMA/rxe: Recheck the MR in when generating a READ reply
RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core()
RDMA/rxe: Fix "Replace mr by rkey in responder resources"
fbdev:
- hotunplugging fix
amdgpu:
- Fix a xen dom0 regression on APUs
- Fix a potential array overflow if a receiver were to
send an erroneous audio channel count
msm:
- lockdep fix.
it6505:
- kconfig fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmJ0nIoACgkQDHTzWXnE
hr5BVxAAlTPqazcayV1T2cTEWeTjPD39D3t+WMP+hNy7aEgRa0LBHYdciRWx4gLx
4ddu0VutlCtlUC2bnhQNPfJ7hbcStCvcsu5tRaIk9JwhcngwjAWFCc0sWDQwU2OK
7OILBAFMd5t4fA+YxfySykI1pIQUGFB9QMGWakPV1xPKo2oyyJbPxpGcQcZXjp5Z
myKnfM2LaG+7gyCrgizh9218vei7XSVbwcKHW2rHA5002kthb0K1rH+Q7sSuRbiW
ozZgC7RrdGwpu/iz3oGDef8b9AJv3JQgL87FE9LqetAV9F7t3tc9y8WzM50PJh4u
Q/AKj9xRo/NxDwBKvn/YeL/7UgQyA+Ob0sQr/ObqMXcowvDWP0gCTxySkLEzS9tv
WuE6w3qLJ9GHHiMI8BUYtKGuFC5/E4sLqXhL2hUiFq7yQXWokL/ocAp7RYBh6ls9
2uBWWCEIv9MJ/vV23kMAg8sdqk4nvAVUtucp0z3Ee7w0Vt3v0YMtOhxXoD42Qg/w
eg/gMo39BgEKJsvFv5OQcRzGi5Sm9VKm9uAQN97dMEoe3tJPs7udCMehKwVebGBa
z6M1ch+qoIydFdVU1mPBqsmtHSPPlpfmRYVx6TY656RE/eanlVB76qUDBr5J2Ulw
yMjLPdyqRfTUwXatAqF9eBv9RRmb3xdX9iKkhEsAFf8jXKgJFbs=
=Ak9r
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"A pretty quiet week, one fbdev, msm, kconfig, and two amdgpu fixes,
about what I'd expect for rc6.
fbdev:
- hotunplugging fix
amdgpu:
- Fix a xen dom0 regression on APUs
- Fix a potential array overflow if a receiver were to send an
erroneous audio channel count
msm:
- lockdep fix.
it6505:
- kconfig fix"
* tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT
drm/amdgpu: do not use passthrough mode in Xen dom0
drm/bridge: ite-it6505: add missing Kconfig option select
fbdev: Make fb_release() return -ENODEV if fbdev was unregistered
drm/msm/dp: remove fail safe mode related code
When one port's input state get inverted (eg. from low to hight) after
pca953x_irq_setup but before setting irq_mask (by some other driver such as
"gpio-keys"), the next inversion of this port (eg. from hight to low) will not
be triggered any more (because irq_stat is not updated at the first time). Issue
should be fixed after this commit.
Fixes: 89ea8bbe9c ("gpio: pca953x.c: add interrupt handling capability")
Signed-off-by: Puyou Lu <puyou.lu@gmail.com>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
[Why]
Z10 and S0i3 have some shared path. Previous code clean up ,
incorrectly removed these pointers, which breaks s0i3 restore
[How]
Do not clear the function pointers based on Z10 disable.
Reviewed-by: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
Acked-by: Pavle Kotarac <Pavle.Kotarac@amd.com>
Signed-off-by: Eric Yang <Eric.Yang2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Active State Power Management (ASPM) feature is enabled since kernel 5.14.
There are some AMD Volcanic Islands (VI) GFX cards, such as the WX3200 and
RX640, that do not work with ASPM-enabled Intel Alder Lake based systems.
Using these GFX cards as video/display output, Intel Alder Lake based
systems will freeze after suspend/resume.
The issue was originally reported on one system (Dell Precision 3660 with
BIOS version 0.14.81), but was later confirmed to affect at least 4
pre-production Alder Lake based systems.
Add an extra check to disable ASPM on Intel Alder Lake based systems with
the problematic AMD Volcanic Islands GFX cards.
Fixes: 0064b0ce85 ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
Signed-off-by: Richard Gong <richard.gong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
When building the Surface Aggregator Module (SAM) core, registry, and
other SAM client drivers as builtin modules (=y), proper initialization
order is not guaranteed. Due to this, client driver registration
(triggered by device registration in the registry) races against bus
initialization in the core.
If any attempt is made at registering the device driver before the bus
has been initialized (i.e. if bus initialization fails this race) driver
registration will fail with a message similar to:
Driver surface_battery was unable to register with bus_type surface_aggregator because the bus was not initialized
Switch from module_init() to subsys_initcall() to resolve this issue.
Note that the serdev subsystem uses postcore_initcall() so we are still
able to safely register the serdev device driver for the core.
Fixes: c167b9c7e3 ("platform/surface: Add Surface Aggregator subsystem")
Reported-by: Blaž Hrastnik <blaz@mxxn.io>
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Link: https://lore.kernel.org/r/20220429195738.535751-1-luzmaximilian@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
The new Surface Pro 8 uses GPEs for lid events as well. Add an entry for
that so that the lid can be used to wake the device. Note that this is a
device with a keyboard type-cover, where this acts as the "lid".
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Link: https://lore.kernel.org/r/20220429180049.1282447-1-luzmaximilian@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
There was an issue with the dual fan probe whereby the probe was
failing as it assuming that second_fan support was not available.
Corrected the logic so the probe works correctly. Cleaned up so
quirks only used if 2nd fan not detected.
Tested on X1 Carbon 10 (2 fans), X1 Carbon 9 (2 fans) and T490 (1 fan)
Signed-off-by: Mark Pearson <markpearson@lenovo.com>
Link: https://lore.kernel.org/r/20220502191200.63470-1-markpearson@lenovo.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Lenovo laptops that contain NVME SSDs across a variety of generations have
trouble resuming from suspend to idle when the IOMMU translation layer is
active for the NVME storage device.
This generally manifests as a large resume delay or page faults. These
delays and page faults occur as a result of a Lenovo BIOS specific SMI
that runs during the D3->D0 transition on NVME devices.
This SMI occurs because of a flag that is set during resume by Lenovo
firmware:
```
OperationRegion (PM80, SystemMemory, 0xFED80380, 0x10)
Field (PM80, AnyAcc, NoLock, Preserve)
{
SI3R, 1
}
Method (_ON, 0, NotSerialized) // _ON_: Power On
{
TPST (0x60D0)
If ((DAS3 == 0x00))
{
If (SI3R)
{
TPST (0x60E0)
M020 (NBRI, 0x00, 0x00, 0x04, (NCMD | 0x06))
M020 (NBRI, 0x00, 0x00, 0x10, NBAR)
APMC = HDSI /* \HDSI */
SLPS = 0x01
SI3R = 0x00
TPST (0x60E1)
}
D0NV = 0x01
}
}
```
Create a quirk that will run early in the resume process to prevent this
SMI from running. As any of these machines are fixed, they can be peeled
back from this quirk or narrowed down to individual firmware versions.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1910
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1689
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mark Pearson <markpearson@lenvo.com>
Link: https://lore.kernel.org/r/20220429030501.1909-3-mario.limonciello@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
DMI matching in thinkpad_acpi happens local to a function meaning
quirks can only match that function.
Future changes to thinkpad_acpi may need to quirk other code, so
change this to use a quirk infrastructure.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mark Pearson <markpearson@lenvo.com>
Link: https://lore.kernel.org/r/20220429030501.1909-2-mario.limonciello@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
The list appears to be grouped by type (silicon, software, cloud) and
mostly alphabetical within each group, with a few exceptions.
Before adding to it, cleanup the list to be alphabetical within the
groups, and use tabs consistently throughout the list.
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
Link: https://lore.kernel.org/r/ec574b5d55584a3adda9bd31b7695193636ff136.1650995848.git.darren@os.amperecomputing.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Device drivers may decide to not load firmware when probed to avoid
slowing down the boot process should the firmware filesystem not be
available yet. In this case, the firmware loading request may be done
when a device file associated with the driver is first accessed. The
credentials of the userspace process accessing the device file may be
used to validate access to the firmware files requested by the driver.
Ensure that the kernel assumes the responsibility of reading the
firmware.
This was observed on Android for a graphic driver loading their firmware
when the device file (e.g. /dev/mali0) was first opened by userspace
(i.e. surfaceflinger). The security context of surfaceflinger was used
to validate the access to the firmware file (e.g.
/vendor/firmware/mali.bin).
Previously, Android configurations were not setting up the
firmware_class.path command line argument and were relying on the
userspace fallback mechanism. In this case, the security context of the
userspace daemon (i.e. ueventd) was consistently used to read firmware
files. More Android devices are now found to set firmware_class.path
which gives the kernel the opportunity to read the firmware directly
(via kernel_read_file_from_path_initns). In this scenario, the current
process credentials were used, even if unrelated to the loading of the
firmware file.
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Cc: <stable@vger.kernel.org> # 5.10
Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20220502004952.3970800-1-tweek@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver is calling framebuffer_release() in its .remove callback, but
this will cause the struct fb_info to be freed too early. Since it could
be that a reference is still hold to it if user-space opened the fbdev.
This would lead to a use-after-free error if the framebuffer device was
unregistered but later a user-space process tries to close the fbdev fd.
To prevent this, move the framebuffer_release() call to fb_ops.fb_destroy
instead of doing it in the driver's .remove callback.
Strictly speaking, the code flow in the driver is still wrong because all
the hardware cleanupd (i.e: iounmap) should be done in .remove while the
software cleanup (i.e: releasing the framebuffer) should be done in the
.fb_destroy handler. But this at least makes to match the behavior before
commit 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal").
Fixes: 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal")
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220505220631.366371-1-javierm@redhat.com
The driver is calling framebuffer_release() in its .remove callback, but
this will cause the struct fb_info to be freed too early. Since it could
be that a reference is still hold to it if user-space opened the fbdev.
This would lead to a use-after-free error if the framebuffer device was
unregistered but later a user-space process tries to close the fbdev fd.
To prevent this, move the framebuffer_release() call to fb_ops.fb_destroy
instead of doing it in the driver's .remove callback.
Strictly speaking, the code flow in the driver is still wrong because all
the hardware cleanupd (i.e: iounmap) should be done in .remove while the
software cleanup (i.e: releasing the framebuffer) should be done in the
.fb_destroy handler. But this at least makes to match the behavior before
commit 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal").
Fixes: 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal")
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220505220540.366218-1-javierm@redhat.com
The driver is calling framebuffer_release() in its .remove callback, but
this will cause the struct fb_info to be freed too early. Since it could
be that a reference is still hold to it if user-space opened the fbdev.
This would lead to a use-after-free error if the framebuffer device was
unregistered but later a user-space process tries to close the fbdev fd.
To prevent this, move the framebuffer_release() call to fb_ops.fb_destroy
instead of doing it in the driver's .remove callback.
Strictly speaking, the code flow in the driver is still wrong because all
the hardware cleanupd (i.e: iounmap) should be done in .remove while the
software cleanup (i.e: releasing the framebuffer) should be done in the
.fb_destroy handler. But this at least makes to match the behavior before
commit 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal").
Fixes: 27599aacba ("fbdev: Hot-unplug firmware fb devices on forced removal")
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220505220456.366090-1-javierm@redhat.com
Most fbdev drivers have issues with the fb_info lifetime, because call to
framebuffer_release() from their driver's .remove callback, rather than
doing from fbops.fb_destroy callback.
Doing that will destroy the fb_info too early, while references to it may
still exist, leading to a use-after-free error.
To prevent this, check the fb_info reference counter when attempting to
kfree the data structure in framebuffer_release(). That will leak it but
at least will prevent the mentioned error.
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220505220413.365977-1-javierm@redhat.com
This reverts commit aafa025c76. That commit
attempted to fix a NULL pointer dereference, caused by the struct fb_info
associated with a framebuffer device to not longer be valid when the file
descriptor was closed.
The issue was exposed by commit 27599aacba ("fbdev: Hot-unplug firmware
fb devices on forced removal"), which added a new path that goes through
the struct device removal instead of directly unregistering the fb.
Most fbdev drivers have issues with the fb_info lifetime, because call to
framebuffer_release() from their driver's .remove callback, rather than
doing from fbops.fb_destroy callback. This meant that due to this switch,
the fb_info was now destroyed too early, while references still existed,
while before it was simply leaked.
The patch we're reverting here reinstated that leak, hence "fixed" the
regression. But the proper solution is to fix the drivers to not release
the fb_info too soon.
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504115917.758787-1-javierm@redhat.com
With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
dynamic checks for string size which can panic the kernel, like incase
of overflow detection.
In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id with
string operations, to populate the nvdimm_events_map array. Since
stat_id variable is not NULL terminated, the kernel panics with
CONFIG_FORTIFY_SOURCE enabled at boot time.
Below are the logs of kernel panic:
detected buffer overflow in __fortify_strlen
------------[ cut here ]------------
kernel BUG at lib/string_helpers.c:980!
Oops: Exception in kernel mode, sig: 5 [#1]
NIP [c00000000077dad0] fortify_panic+0x28/0x38
LR [c00000000077dacc] fortify_panic+0x24/0x38
Call Trace:
[c0000022d77836e0] [c00000000077dacc] fortify_panic+0x24/0x38 (unreliable)
[c00800000deb2660] papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm]
[c00800000deb2cb0] papr_scm_probe+0x288/0x62c [papr_scm]
[c0000000009b46a8] platform_probe+0x98/0x150
Fix this issue by using kmemdup_nul() to copy the content of
stat->stat_id directly to the nvdimm_events_map array.
mpe: stat->stat_id comes from the hypervisor, not userspace, so there is
no security exposure.
Fixes: 4c08d4bbc0 ("powerpc/papr_scm: Add perf interface support")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220505153451.35503-1-kjain@linux.ibm.com
Vladimir Oltean says:
====================
Ocelot VCAP fixes
Changes in v2:
fix the NPDs and UAFs caused by filter->trap_list in a more robust way
that actually does not introduce bugs of its own (1/5)
This series fixes issues found while running
tools/testing/selftests/net/forwarding/tc_actions.sh on the ocelot
switch:
- NULL pointer dereference when failing to offload a filter
- NULL pointer dereference after deleting a trap
- filters still having effect after being deleted
- dropped packets still being seen by software
- statistics counters showing double the amount of hits
- statistics counters showing inexistent hits
- invalid configurations not rejected
====================
Link: https://lore.kernel.org/r/20220504235503.4161890-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Given the following order of operations:
(1) we add filter A using tc-flower
(2) we send a packet that matches it
(3) we read the filter's statistics to find a hit count of 1
(4) we add a second filter B with a higher preference than A, and A
moves one position to the right to make room in the TCAM for it
(5) we send another packet, and this matches the second filter B
(6) we read the filter statistics again.
When this happens, the hit count of filter A is 2 and of filter B is 1,
despite a single packet having matched each filter.
Furthermore, in an alternate history, reading the filter stats a second
time between steps (3) and (4) makes the hit count of filter A remain at
1 after step (6), as expected.
The reason why this happens has to do with the filter->stats.pkts field,
which is written to hardware through the call path below:
vcap_entry_set
/ | \
/ | \
/ | \
/ | \
es0_entry_set is1_entry_set is2_entry_set
\ | /
\ | /
\ | /
vcap_data_set(data.counter, ...)
The primary role of filter->stats.pkts is to transport the filter hit
counters from the last readout all the way from vcap_entry_get() ->
ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
The reason why vcap_entry_set() writes it to hardware is so that the
counters (saturating and having a limited bit width) are cleared
after each user space readout.
The writing of filter->stats.pkts to hardware during the TCAM entry
movement procedure is an unintentional consequence of the code design,
because the hit count isn't up to date at this point.
So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
make room for filter B, the hardware hit count is 0 (no packet matched
on it in the meantime), but filter->stats.pkts is 1, because the last
readout saw the earlier packet. The movement procedure programs the old
hit count back to hardware, so this creates the impression to user space
that more packets have been matched than they really were.
The bug can be seen when running the gact_drop_and_ok_test() from the
tc_actions.sh selftest.
Fix the issue by reading back the hit count to tmp->stats.pkts before
migrating the VCAP filter. Sure, this is a best-effort technique, since
the packets that hit the rule between vcap_entry_get() and
vcap_entry_set() won't be counted, but at least it allows the counters
to be reliably used for selftests where the traffic is under control.
The vcap_entry_get() name is a bit unintuitive, but it only reads back
the counter portion of the TCAM entry, not the entire entry.
The index from which we retrieve the counter is also a bit unintuitive
(i - 1 during add, i + 1 during del), but this is the way in which TCAM
entry movement works. The "entry index" isn't a stored integer for a
TCAM filter, instead it is dynamically computed by
ocelot_vcap_block_get_filter_index() based on the entry's position in
the &block->rules list. That position (as well as block->count) is
automatically updated by ocelot_vcap_filter_add_to_block() on add, and
by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
index, and "i - 1" or "i + 1" respectively are the old addresses of that
TCAM entry (we only support installing/deleting one filter at a time).
Fixes: b596229448 ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once the CPU port was added to the destination port mask of a packet, it
can never be cleared, so even packets marked as dropped by the MASK_MODE
of a VCAP IS2 filter will still reach it. This is why we need the
OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
stop seeing them.
We disallow policer rules from being put on any other chain than the one
for the first lookup, but we don't do this for "drop" rules, although we
should. This change is merely ascertaining that the rules dont't
(completely) work and letting the user know.
The blamed commit is the one that introduced the multi-chain architecture
in ocelot. Prior to that, we should have always offloaded the filters to
VCAP IS2 lookup 0, where they did work.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
configured to only match during the first, second lookup, or both, or
none.
The blamed commit wrote the code for making VCAP IS2 filters match only
on the given lookup. But right below that code, there was another line
that explicitly made the lookup a "don't care", and this is overwriting
the lookup we've selected. So the code had no effect.
Some of the more noticeable effects of having filters match on both
lookups:
- in "tc -s filter show dev swp0 ingress", we see each packet matching a
VCAP IS2 filter counted twice. This throws off scripts such as
tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
fail.
- a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
because once the CPU port becomes a member of the destination port
mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
the action vector can only appear in the first lookup. What happens
when a filter matches both lookups is that the action vector is
combined, and this makes the POLICE_ENA bit ineffective, since the
last lookup in which it has appeared is the second one. In other
words, "tc-drop" actions do not drop packets for the CPU port, dropped
packets are still seen by software unless there was an FDB entry that
directed those packets to some other place different from the CPU.
The last bit used to work, because in the initial commit b596229448
("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
field of the VCAP IS2 half key with a 1, not with a "don't care".
The change to "don't care" was made inadvertently by me in commit
c1c3993edb ("net: mscc: ocelot: generalize existing code for VCAP"),
which I just realized, and which needs a separate fix from this one,
for "stable" kernels that lack the commit blamed below.
Fixes: 226e9cd82a ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ocelot_vcap_filter_del() works by moving the next filters over the
current one, and then deleting the last filter by calling vcap_entry_set()
with a del_filter which was specially created by memsetting its memory
to zeroes. vcap_entry_set() then programs this to the TCAM and action
RAM via the cache registers.
The problem is that vcap_entry_set() is a dispatch function which looks
at del_filter->block_id. But since del_filter is zeroized memory, the
block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
or IS2.
The code was not always like this. vcap_entry_set() used to simply be
is2_entry_set(), and then, the logic used to work.
Restore the functionality by populating the block_id of the del_filter
based on the VCAP block of the filter that we're deleting. This makes
vcap_entry_set() know what to do.
Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the blamed commit, VCAP filters can appear on more than one list.
If their action is "trap", they are chained on ocelot->traps via
filter->trap_list. This is in addition to their normal placement on the
VCAP block->rules list head.
Therefore, when we free a VCAP filter, we must remove it from all lists
it is a member of, including ocelot->traps.
There are at least 2 bugs which are direct consequences of this design
decision.
First is the incorrect usage of list_empty(), meant to denote whether
"filter" is chained into ocelot->traps via filter->trap_list.
This does not do the correct thing, because list_empty() checks whether
"head->next == head", but in our case, head->next == head->prev == NULL.
So we dereference NULL pointers and die when we call list_del().
Second is the fact that not all places that should remove the filter
from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
which is where we have the main kfree(filter). By keeping freed filters
in ocelot->traps we end up in a use-after-free in
felix_update_trapping_destinations().
Attempting to fix all the buggy patterns is a whack-a-mole game which
makes the driver unmaintainable. Actually this is what the previous
patch version attempted to do:
https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
but it introduced another set of bugs, because there are other places in
which create VCAP filters, not just ocelot_vcap_filter_create():
- ocelot_trap_add()
- felix_tag_8021q_vlan_add_rx()
- felix_tag_8021q_vlan_add_tx()
Relying on the convention that all those code paths must call
INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
So let's do what should have been done in the first place and keep a
bool in struct ocelot_vcap_filter which denotes whether we are looking
at a trapping rule or not. Iterating now happens over the main VCAP IS2
block->rules. The advantage is that we no longer risk having stale
references to a freed filter, since it is only present in that list.
Fixes: e42bd4ed09 ("net: mscc: ocelot: keep traps in a list")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Read requests that return with NRF error are partially completed in
dasd_eckd_ese_read(). The function keeps track of the amount of
processed bytes and the driver will eventually return this information
back to the block layer for further processing via __dasd_cleanup_cqr()
when the request is in the final stage of processing (from the driver's
perspective).
For this, blk_update_request() is used which requires the number of
bytes to complete the request. As per documentation the nr_bytes
parameter is described as follows:
"number of bytes to complete for @req".
This was mistakenly interpreted as "number of bytes _left_ for @req"
leading to new requests with incorrect data length. The consequence are
inconsistent and completely wrong read requests as data from random
memory areas are read back.
Fix this by correctly specifying the amount of bytes that should be used
to complete the request.
Fixes: 5e6bdd37c5 ("s390/dasd: fix data corruption for thin provisioned devices")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When reading unformatted tracks on ESE devices, the corresponding memory
areas are simply set to zero for each segment. This is done incorrectly
for blocksizes < 4096.
There are two problems. First, the increment of dst is done using the
counter of the loop (off), which is increased by blksize every
iteration. This leads to a much bigger increment for dst as actually
intended. Second, the increment of dst is done before the memory area
is set to 0, skipping a significant amount of bytes of memory.
This leads to illegal overwriting of memory and ultimately to a kernel
panic.
This is not a problem with 4k blocksize because
blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a
single iteration for the inner segment loop (bv.bv_len == blksize). The
incorrectly used 'off' value to increment dst is 0 and the correct
memory area is used.
In order to fix this for blksize < 4k, increment dst correctly using the
blksize and only do it at the end of the loop.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For ESE devices we get an error for write operations on an unformatted
track. Afterwards the track will be formatted and the IO operation
restarted.
When using alias devices a track might be accessed by multiple requests
simultaneously and there is a race window that a track gets formatted
twice resulting in data loss.
Prevent this by remembering the amount of formatted tracks when starting
a request and comparing this number before actually formatting a track
on the fly. If the number has changed there is a chance that the current
track was finally formatted in between. As a result do not format the
track and restart the current IO to check.
The number of formatted tracks does not match the overall number of
formatted tracks on the device and it might wrap around but this is no
problem. It is only needed to recognize that a track has been formatted at
all in between.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For ESE devices we get an error when accessing an unformatted track.
The handling of this error will return zero data for read requests and
format the track on demand before writing to it. To do this the code needs
to distinguish between read and write requests. This is done with data from
the blocklayer request. A pointer to the blocklayer request is stored in
the CQR.
If there is an error on the device an ERP request is built to do error
recovery. While the ERP request is mostly a copy of the original CQR the
pointer to the blocklayer request is not copied to not accidentally pass
it back to the blocklayer without cleanup.
This leads to the error that during ESE handling after an ERP request was
built it is not possible to determine the IO direction. This leads to the
formatting of a track for read requests which might in turn lead to data
corruption.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The find_next_netdev_feature() macro gets the "remaining length",
not bit index.
Passing "bit - 1" for the following iteration is wrong as it skips
the adjacent bit. Pass "bit" instead.
Fixes: 3b89ea9c59 ("net: Fix for_each_netdev_feature on Big endian")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nicolas Dichtel says:
====================
vrf: fix address binding with icmp socket
The first patch fixes the issue.
The second patch adds related tests in selftests.
====================
Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'ping' utility is able to manage two kind of sockets (raw or icmp),
depending on the sysctl ping_group_range. By default, ping_group_range is
set to '1 0', which forces ping to use an ip raw socket.
Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
After the previous patch, ipv4 tests results are the same with both kinds
of socket. For ipv6, there are a lot a new failures (the previous patch
fixes only two cases).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.
Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0 2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address
CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>