also fixes some corner case bugs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EjbwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g5ihAAw1B4ZV+yG6BKMIx8oC+LSefSX0lk6nv0
pmPWIdvmlL1FHlyvALymU2UCneA7KYkGjdz6LYtXlvdGTpVj1M7mJAFBp8ugJj2k
oOelkohTsTHOU/Oae4+OX8E8Kt5PsFS21wSoTo+WYI2pgl1iGzVPPCreU+uMB3lP
IaenOvBlNeXUy0S1tDYLFQT3wtS+lC2Yu6mG5AklDxGvAxpO73U9KJnRwpx34aCL
Z24fW5gZkZg5KQa1ULjZLMUoCNEz24QP/V5PMmgbD3QRkxTeNliaGPvbBV5c2fJd
WOaI4QUEah2wTtRgXv3z1SMbhAFnIoo1fAZKs1DEfT+HUO+Aw6YDpH58TdaPCs4/
ES3Tk9lcRUITAeilbO4pGPUEq+9FPHZ+hZaevQTDcjHmE4VPG984UqFzm2Pc4NEb
gZ6R0zKU9I7w6GFyXn3KeaPwf79CVGuJ4Zd1Is86ab2ft7V3m4cyHT7BKBkF+skD
MQEAhroE7WL/+E1mvK/zej9I9PTbHtk50GdmKTzcnrNdFLehGz6hyMM/PnBKE2j5
Gsw+x/DjCXQSpwVTkJCBLf3BqMpbi7C5vzwbIPYcKuE3Ggo0f1n9rUUJwCNybo21
XhSWq8GFX35+5nBLfznXRbaU1uyvWwWvheWwR56LbTHgdPUlgVr/aa/KQ5bliz0a
JCUUcDnUZjw=
=+VGg
-----END PGP SIGNATURE-----
Merge tag 'x86-kaslr-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kaslr updates from Ingo Molnar:
"This cleans up and simplifies the x86 KASLR code, and also fixes some
corner case bugs"
* tag 'x86-kaslr-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/kaslr: Replace strlen() with strnlen()
x86/kaslr: Add a check that the random address is in range
x86/kaslr: Make local variables 64-bit
x86/kaslr: Replace 'unsigned long long' with 'u64'
x86/kaslr: Make minimum/image_size 'unsigned long'
x86/kaslr: Small cleanup of find_random_phys_addr()
x86/kaslr: Drop unnecessary alignment in find_random_virt_addr()
x86/kaslr: Drop redundant check in store_slot_info()
x86/kaslr: Make the type of number of slots/slot areas consistent
x86/kaslr: Drop test for command-line parameters before parsing
x86/kaslr: Simplify process_gb_huge_pages()
x86/kaslr: Short-circuit gb_huge_pages on x86-32
x86/kaslr: Fix off-by-one error in process_gb_huge_pages()
x86/kaslr: Drop some redundant checks from __process_mem_region()
x86/kaslr: Drop redundant variable in __process_mem_region()
x86/kaslr: Eliminate 'start_orig' local variable from __process_mem_region()
x86/kaslr: Drop redundant cur_entry from __process_mem_region()
x86/kaslr: Fix off-by-one error in __process_mem_region()
x86/kaslr: Initialize mem_limit to the real maximum address
x86/kaslr: Fix process_efi_entries comment
...
x86 Intel updates:
- Add Jasper Lake support
- Add support for TopDown metrics on Ice Lake
- Fix Ice Lake & Tiger Lake uncore support, add Snow Ridge support
- Add a PCI sub driver to support uncore PMUs where the PCI resources
have been claimed already - extending the range of supported systems.
x86 AMD updates:
- Restore 'perf stat -a' behaviour to program the uncore PMU
to count all CPU threads.
- Fix setting the proper count when sampling Large Increment
per Cycle events / 'paired' events.
- Fix IBS Fetch sampling on F17h and some other IBS fine tuning,
greatly reducing the number of interrupts when large sample
periods are specified.
- Extends Family 17h RAPL support to also work on compatible
F19h machines.
Core code updates:
- Fix race in perf_mmap_close()
- Add PERF_EV_CAP_SIBLING, to denote that sibling events should be
closed if the leader is removed.
- Smaller fixes and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ef40RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h7NQ//ZdQ26Yg79ZaxBX1QSINJ9AgXDi6rXs75
qU9qNwr/6EF+633RZoPQGAE0Iy5v6h7iLFokcJzM9+kK/rE3ax44tSnPlcMa0+6N
SHXKCa5iL+hH7o2Spo2MZwCYseH79rloX3TSH7ajnN3X8PvwgWshF0lUE3WEWtCs
eHSojdCk43IuL9TpusuNOBM2FvgnheFYWiMbFHd0MTBUMxul30sLVCG8IIWCPA+q
TwG4RJS3X42VbL3SuAGFmOv4OmqNsfkvHvjpDs4NF07tRB9zjXzGrxmGhgSw0NAN
2KK25qbmrpKATIb4Eqsgk/yikX/SCrDEXrjhg3r8FnyPvRfctq1crZjjf672PI2E
bDda76dH6Lq9jv5fsyJjas5OsYdMKBCnA+tGQxXPGbmTXeEcYMRbDnwhYnevI/Q/
8pP+xstF0pmBA3tvpDPrQnYH72Qt7CLJSdcTB15NqZftU2tJxaAyJGx4gJy33jxQ
wu6BIEGHQ7onQYiIyTwsBHyz6xNsF/CRHwAPcGdYrRRbXB5K5nxHiXNb4awciTMx
2HF31/S4OqURNpfcpxOQo+1fb/cLqj3loGqE4jCTwkbS3lrHcAcfxyv9QNn77l1f
hdQ0jworbUNVLUYEUQz1bkZ06GD3LSSas2ZlY1NNdHo62mjyXMQmgirNcZmrFgWl
tl2gNFAU9x4=
=2fuY
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"x86 Intel updates:
- Add Jasper Lake support
- Add support for TopDown metrics on Ice Lake
- Fix Ice Lake & Tiger Lake uncore support, add Snow Ridge support
- Add a PCI sub driver to support uncore PMUs where the PCI resources
have been claimed already - extending the range of supported
systems.
x86 AMD updates:
- Restore 'perf stat -a' behaviour to program the uncore PMU to count
all CPU threads.
- Fix setting the proper count when sampling Large Increment per
Cycle events / 'paired' events.
- Fix IBS Fetch sampling on F17h and some other IBS fine tuning,
greatly reducing the number of interrupts when large sample periods
are specified.
- Extends Family 17h RAPL support to also work on compatible F19h
machines.
Core code updates:
- Fix race in perf_mmap_close()
- Add PERF_EV_CAP_SIBLING, to denote that sibling events should be
closed if the leader is removed.
- Smaller fixes and updates"
* tag 'perf-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
perf/core: Fix race in the perf_mmap_close() function
perf/x86: Fix n_metric for cancelled txn
perf/x86: Fix n_pair for cancelled txn
x86/events/amd/iommu: Fix sizeof mismatch
perf/x86/intel: Check perf metrics feature for each CPU
perf/x86/intel: Fix Ice Lake event constraint table
perf/x86/intel/uncore: Fix the scale of the IMC free-running events
perf/x86/intel/uncore: Fix for iio mapping on Skylake Server
perf/x86/msr: Add Jasper Lake support
perf/x86/intel: Add Jasper Lake support
perf/x86/intel/uncore: Reduce the number of CBOX counters
perf/x86/intel/uncore: Update Ice Lake uncore units
perf/x86/intel/uncore: Split the Ice Lake and Tiger Lake MSR uncore support
perf/x86/intel/uncore: Support PCIe3 unit on Snow Ridge
perf/x86/intel/uncore: Generic support for the PCI sub driver
perf/x86/intel/uncore: Factor out uncore_pci_pmu_unregister()
perf/x86/intel/uncore: Factor out uncore_pci_pmu_register()
perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu()
perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info()
perf/amd/uncore: Inform the user how many counters each uncore PMU has
...
applied to indirect function calls. Remove a data load (indirection) by
modifying the text.
They give the flexibility of function pointers, but with better
performance. (This is especially important for cases where
retpolines would otherwise be used, as retpolines can be pretty
slow.)
API overview:
DECLARE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL_NULL(name, typename);
static_call(name)(args...);
static_call_cond(name)(args...);
static_call_update(name, func);
x86 is supported via text patching, otherwise basic indirect calls are used,
with function pointers.
There's a second variant using inline code patching, inspired by jump-labels,
implemented on x86 as well.
The new APIs are utilized in the x86 perf code, a heavy user of function pointers,
where static calls speed up the PMU handler by 4.2% (!).
The generic implementation is not really excercised on other architectures,
outside of the trivial test_static_call_init() self-test.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EfAQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iEAw//divHeVCJnHhV+YBbuI9ROUsERkzu8VhK
O1DEmW68Fvj7pszT8NZsMjtkt97ZtxDRK7aCJiiup0eItG9qCJ8lpCLb84ZbizHV
HhCbhBLrpxSvTrWlQnkgP1OkPAbtoryIjVlZzWhjye2MY8UEbVnZWyviBolbAAxH
Fk1Yi56fIMu19GO+9Ohzy9E2VDnVEH1iMx5YWoLD2H88Qbq/yEMP+U2tIj8hIVKT
Y/jdogihNXRIau6QB+YPfDPisdty+RHxfU7zct4Rv8cFF5ylglZB5fD34C3sUQF2
WqsaYz7zjUj9f02F8pw8hIaAT7InzArPhlNVITxf2oMfmdrNqBptnSCddZqCJLvv
oDGew21k50Zcbqkv9amclpxXH5tTpRvJeqit2pz/85GMeqBRuhzHUAkCpht5YA73
qJsHWS3z+qIxKi0tDbhDJswuwa51q5sgdUUwo1uCr3wT3DGDlqNhCAZBzX14dcty
0shDSbv13TCwqAcb7asPzEoPwE15cwa+x+viGEIL901pyZKyQYjs/abDU26It3BW
roWRkuVJZ9/QMdZJs1v7kaXw1L8YiKIDkBgke+xbfrDwEvvjudQkl2LUL66DB11j
RJU3GyxKClvdY06SSRh/H13fqZLNKh1JZ0nPEWSTJECDFN9zcDjrDrod/7PFOcpY
NAlawLoGG+s=
=JvpF
-----END PGP SIGNATURE-----
Merge tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull static call support from Ingo Molnar:
"This introduces static_call(), which is the idea of static_branch()
applied to indirect function calls. Remove a data load (indirection)
by modifying the text.
They give the flexibility of function pointers, but with better
performance. (This is especially important for cases where retpolines
would otherwise be used, as retpolines can be pretty slow.)
API overview:
DECLARE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL(name, func);
DEFINE_STATIC_CALL_NULL(name, typename);
static_call(name)(args...);
static_call_cond(name)(args...);
static_call_update(name, func);
x86 is supported via text patching, otherwise basic indirect calls are
used, with function pointers.
There's a second variant using inline code patching, inspired by
jump-labels, implemented on x86 as well.
The new APIs are utilized in the x86 perf code, a heavy user of
function pointers, where static calls speed up the PMU handler by
4.2% (!).
The generic implementation is not really excercised on other
architectures, outside of the trivial test_static_call_init()
self-test"
* tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
static_call: Fix return type of static_call_init
tracepoint: Fix out of sync data passing by static caller
tracepoint: Fix overly long tracepoint names
x86/perf, static_call: Optimize x86_pmu methods
tracepoint: Optimize using static_call()
static_call: Allow early init
static_call: Add some validation
static_call: Handle tail-calls
static_call: Add static_call_cond()
x86/alternatives: Teach text_poke_bp() to emulate RET
static_call: Add simple self-test for static calls
x86/static_call: Add inline static call implementation for x86-64
x86/static_call: Add out-of-line static call implementation
static_call: Avoid kprobes on inline static_call()s
static_call: Add inline static call infrastructure
static_call: Add basic static call infrastructure
compiler.h: Make __ADDRESSABLE() symbol truly unique
jump_label,module: Fix module lifetime for __jump_label_mod_text_reserved()
module: Properly propagate MODULE_STATE_COMING failure
module: Fix up module_notifier return values
...
because the heuristics that various linkers & compilers use to handle them
(include these bits into the output image vs discarding them silently)
are both highly idiosyncratic and also version dependent.
Instead of this historically problematic mess, this tree by Kees Cook (et al)
adds build time asserts and build time warnings if there's any orphan section
in the kernel or if a section is not sized as expected.
And because we relied on so many silent assumptions in this area, fix a metric
ton of dependencies and some outright bugs related to this, before we can
finally enable the checks on the x86, ARM and ARM64 platforms.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Edv4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hiKBAApdJEOaK7hMc3013DYNctklIxEPJL2mFJ
11YJRIh4pUJTF0TE+EHT/D+rSIuRsyuoSmOQBQ61/wVSnyG067GjjVJRqh/eYaJ1
fDhJi2FuHOjXl+CiN0KxzBjjp+V4NhF7jHT59tpQSvfZeg7FjteoxfztxaCp5ek3
S3wHB3CC4c4jE3lfjHem1E9/PwT4kwPYx1c3gAUdEqJdjkihjX9fWusfjLeqW6/d
Y5VkApi6bL9XiZUZj5l0dEIweLJJ86+PkKJqpo3spxxEak1LSn1MEix+lcJ8e1Kg
sb/bEEivDcmFlFWOJnn0QLquCR0Cx5bz1pwsL0tuf0yAd4+sXX5IMuGUysZlEdKM
BHL9h5HbevGF4BScwZwZH7lyEg7q67s5KnRu4hxy0Swfcj7y0oT/9lXqpbpZ2DqO
Hd+bRRQKIbqnTMp0hcit9LfpLp93vj0dBlaV5ocAJJlu62u9VnwGG5HQuZ5giLUr
kA1SLw63Y1wopFRxgFyER8les7eLsu0zxHeK44rRVlVnfI99OMTOgVNicmDFy3Fm
AfcnfJG0BqBEJGQz5es34uQQKKBwFPtC9NztopI62KiwOspYYZyrO1BNxdOc6DlS
mIHrmO89HMXuid5eolvLaFqUWirHoWO8TlycgZxUWVHc2txVPjAEU/axouU/dSSU
w/6GpzAa+7g=
=fXAw
-----END PGP SIGNATURE-----
Merge tag 'core-build-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull orphan section checking from Ingo Molnar:
"Orphan link sections were a long-standing source of obscure bugs,
because the heuristics that various linkers & compilers use to handle
them (include these bits into the output image vs discarding them
silently) are both highly idiosyncratic and also version dependent.
Instead of this historically problematic mess, this tree by Kees Cook
(et al) adds build time asserts and build time warnings if there's any
orphan section in the kernel or if a section is not sized as expected.
And because we relied on so many silent assumptions in this area, fix
a metric ton of dependencies and some outright bugs related to this,
before we can finally enable the checks on the x86, ARM and ARM64
platforms"
* tag 'core-build-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/boot/compressed: Warn on orphan section placement
x86/build: Warn on orphan section placement
arm/boot: Warn on orphan section placement
arm/build: Warn on orphan section placement
arm64/build: Warn on orphan section placement
x86/boot/compressed: Add missing debugging sections to output
x86/boot/compressed: Remove, discard, or assert for unwanted sections
x86/boot/compressed: Reorganize zero-size section asserts
x86/build: Add asserts for unwanted sections
x86/build: Enforce an empty .got.plt section
x86/asm: Avoid generating unused kprobe sections
arm/boot: Handle all sections explicitly
arm/build: Assert for unwanted sections
arm/build: Add missing sections
arm/build: Explicitly keep .ARM.attributes sections
arm/build: Refactor linker script headers
arm64/build: Assert for unwanted sections
arm64/build: Add missing DWARF sections
arm64/build: Use common DISCARDS in linker script
arm64/build: Remove .eh_frame* sections due to unwind tables
...
- Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree.
- Relax decompressed image placement rules for 32-bit ARM
- Add support for passing MOK certificate table contents via a config table
rather than a EFI variable.
- Add support for 18 bit DIMM row IDs in the CPER records.
- Work around broken Dell firmware that passes the entire Boot#### variable
contents as the command line
- Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can
identify it in the memory map listings.
- Don't abort the boot on arm64 if the EFI RNG protocol is available but
returns with an error
- Replace slashes with exclamation marks in efivarfs file names
- Split efi-pstore from the deprecated efivars sysfs code, so we can
disable the latter on !x86.
- Misc fixes, cleanups and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ec9QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1inTQ//TYj3kJq/7sWfUAxmAsWnUEC005YCNf0T
x3kJQv3zYX4Rl4eEwkff8S1PrqqvwUP5yUZYApp8HD9s9CYvzz5iG5xtf/jX+QaV
06JnTMnkoycx2NaOlbr1cmcIn4/cAhQVYbVCeVrlf7QL8enNTBr5IIQmo4mgP8Lc
mauSsO1XU8ZuMQM+JcZSxAkAPxlhz3dbR5GteP4o2K4ShQKpiTCOfOG1J3FvUYba
s1HGnhHFlkQr6m3pC+iG8dnAG0YtwHMH1eJVP7mbeKUsMXz944U8OVXDWxtn81pH
/Xt/aFZXnoqwlSXythAr6vFTuEEn40n+qoOK6jhtcGPUeiAFPJgiaeAXw3gO0YBe
Y8nEgdGfdNOMih94McRd4M6gB/N3vdqAGt+vjiZSCtzE+nTWRyIXSGCXuDVpkvL4
VpEXpPINnt1FZZ3T/7dPro4X7pXALhODE+pl36RCbfHVBZKRfLV1Mc1prAUGXPxW
E0MfaM9TxDnVhs3VPWlHmRgavee2MT1Tl/ES4CrRHEoz8ZCcu4MfROQyao8+Gobr
VR+jVk+xbyDrykEc6jdAK4sDFXpTambuV624LiKkh6Mc4yfHRhPGrmP5c5l7SnCd
aLp+scQ4T7sqkLuYlXpausXE3h4sm5uur5hNIRpdlvnwZBXpDEpkzI8x0C9OYr0Q
kvFrreQWPLQ=
=ZNI8
-----END PGP SIGNATURE-----
Merge tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI changes from Ingo Molnar:
- Preliminary RISC-V enablement - the bulk of it will arrive via the
RISCV tree.
- Relax decompressed image placement rules for 32-bit ARM
- Add support for passing MOK certificate table contents via a config
table rather than a EFI variable.
- Add support for 18 bit DIMM row IDs in the CPER records.
- Work around broken Dell firmware that passes the entire Boot####
variable contents as the command line
- Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we
can identify it in the memory map listings.
- Don't abort the boot on arm64 if the EFI RNG protocol is available
but returns with an error
- Replace slashes with exclamation marks in efivarfs file names
- Split efi-pstore from the deprecated efivars sysfs code, so we can
disable the latter on !x86.
- Misc fixes, cleanups and updates.
* tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
efi: mokvar: add missing include of asm/early_ioremap.h
efi: efivars: limit availability to X86 builds
efi: remove some false dependencies on CONFIG_EFI_VARS
efi: gsmi: fix false dependency on CONFIG_EFI_VARS
efi: efivars: un-export efivars_sysfs_init()
efi: pstore: move workqueue handling out of efivars
efi: pstore: disentangle from deprecated efivars module
efi: mokvar-table: fix some issues in new code
efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure
efivarfs: Replace invalid slashes with exclamation marks in dentries.
efi: Delete deprecated parameter comments
efi/libstub: Fix missing-prototypes in string.c
efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it
cper,edac,efi: Memory Error Record: bank group/address and chip id
edac,ghes,cper: Add Row Extension to Memory Error Record
efi/x86: Add a quirk to support command line arguments on Dell EFI firmware
efi/libstub: Add efi_warn and *_once logging helpers
integrity: Load certs from the EFI MOK config table
integrity: Move import of MokListRT certs to a separate routine
efi: Support for MOK variable config table
...
- Add deadlock detection for recursive read-locks. The rationale is outlined
in:
224ec489d3: ("lockdep/Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used to
switch between two copies of protected data. This allows the read path,
typically NMIs, to safely interrupt the write side critical section.
We utilize this new variant for sched-clock, and to make x86 TSC handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EX6QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3gxAAkg+Jy/tcdRxlxlEDOQPFy1mBqvFmulNA
pGFPkB6dzqmAWF/NfOZSl4g/h/mqGYsq2V+PfK5E8Sq8DQ/yCmnLhjgVOHNUUliv
x0WWfOysNgJdtdf69NLYJufIQhxhyI0dwFHHoHIsCdGdGqjh2DVevQFPFTBjdpOc
BUZYo+u3gCaCdB6A2nmlcWYbEw8eVEHgv3qLG6dq46J0KJOV0HfliqJoU3EZqH+s
977LvEIo+THfuYWMo/Jepwngbi0y36KeeukOAdwm9fK196htBHIUR+YPPrAe+FWD
z+UXP5IS5XIw9V1sGLmUaC2m+6gpdW19jKBtlzPkxHXmJmsgiZdLLeytEh3WYey7
nzfH+9Jd4NyyZKucLssYkOjf6P5BxGKCyJ9LXb7vlSthIhiDdFNx47oKtW4hxjOY
jubsI3BP5c3G1sIBIjTS53XmOhJg+Z52FxTpQ33JswXn1wGidcHZiuNHZuU5q28p
+tn8rGb2NGJFb4Sw/Vp0yTcqIpEXf+vweiQoaxm6tc9BWzcVzZntGnh0i3gFotx/
VgKafN4+pgXgo6bwHbN2WBK2FGyvcXFaptfaOMZL48En82hJ1DI6EnBEYN+vuERQ
JcCXg+iHeeVbxoou7q8NJxITkBmEL5xNBIugXRRqNSP3fXLxKjFuPYqT84/e7yZi
elGTReYcq6g=
=Iq51
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"These are the locking updates for v5.10:
- Add deadlock detection for recursive read-locks.
The rationale is outlined in commit 224ec489d3 ("lockdep/
Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used
to switch between two copies of protected data. This allows the
read path, typically NMIs, to safely interrupt the write side
critical section.
We utilize this new variant for sched-clock, and to make x86 TSC
handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes"
* tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"
lockdep: Fix lockdep recursion
lockdep: Fix usage_traceoverflow
locking/atomics: Check atomic-arch-fallback.h too
locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc
lockdep: Optimize the memory usage of circular queue
seqlock: Unbreak lockdep
seqlock: PREEMPT_RT: Do not starve seqlock_t writers
seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
seqlock: seqcount_t: Implement all read APIs as statement expressions
seqlock: Use unique prefix for seqcount_t property accessors
seqlock: seqcount_LOCKNAME_t: Standardize naming convention
seqlock: seqcount latch APIs: Only allow seqcount_latch_t
rbtree_latch: Use seqcount_latch_t
x86/tsc: Use seqcount_latch_t
timekeeping: Use seqcount_latch_t
time/sched_clock: Use seqcount_latch_t
seqlock: Introduce seqcount_latch_t
mm/swap: Do not abuse the seqcount_t latching API
time/sched_clock: Use raw_read_seqcount_latch() during suspend
...
- Ensure BTF synchronization under all circumstances
- Distangle kernel and user mode #DB further
- Get ordering vs. the debug notifier correct to make KGDB work more
reliably.
- Cleanup historical gunk and make the code simpler to understand.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+ElmQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoThRD/4zTPn3NXQEY5VFi2CqhGjrm1Z6mKZ5
dDfJDwQ5RByJhzehMy9guZ7QKZcUa7zikI/NHTz4FfcweIyaTY9ypJdo0hdY6DqW
NaUB2zo+AoPDH6/GnxN5pZS4Y3wB2nQRqr0xjYHmbHcv796LRhccZiRb2zVeIpNF
J9uc92p273ygNRq475TBIxmzxVOOq0K8SDupFN6lxZbibkSUFzYMX6mfT0e8UMG0
fRQshModZSbQ+VZ26kysdXL2xN3CgLpK/ZuRJSFEFQKLzcmRzPRAYm3m1eZAeKMr
dtwnScOWCOBDC+/lFuTA3ZOOV06lfGz0tsfmv0lTlg+NzH37weXUeiiBYdQqMobs
4Iy22pzVODVFB2K+Qqh4RM1ZaukRm9hBCqFq8kWedafNCq7sJEZlaeNkR5HLCdnW
A0I4SL+XHgsY3Hg3U41tFgrRMcOdKML6BWsmKB65G3S3baxvebZIOc/1CuFiIaQz
ggN/fhKNclKpGYp+PXgwZQs9cEDeJcI6W2epZdWKjCvi2IE4zJ38liNsg+GeIswD
kBQFSgx9bogB6BPfuK36AiKzQVzgottE2zhVuXz35J03Ixvu5WhmyNIIyCOzVjHD
R/GSBarEVsL6N7/j4RenRaAIgsxajKljFUXEVvdtr6YAIje089PcuUmBVnkKeIUr
HQe/lZcIflg48Q==
=nIzH
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry code updates from Thomas Gleixner:
"More consolidation and correctness fixes for the debug exception:
- Ensure BTF synchronization under all circumstances
- Distangle kernel and user mode #DB further
- Get ordering vs. the debug notifier correct to make KGDB work more
reliably.
- Cleanup historical gunk and make the code simpler to understand"
* tag 'x86-entry-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/debug: Change thread.debugreg6 to thread.virtual_dr6
x86/debug: Support negative polarity DR6 bits
x86/debug: Simplify hw_breakpoint_handler()
x86/debug: Remove aout_dump_debugregs()
x86/debug: Remove the historical junk
x86/debug: Move cond_local_irq_enable() block into exc_debug_user()
x86/debug: Move historical SYSENTER junk into exc_debug_kernel()
x86/debug: Simplify #DB signal code
x86/debug: Remove handle_debug(.user) argument
x86/debug: Move kprobe_debug_handler() into exc_debug_kernel()
x86/debug: Sync BTF earlier
devices which require non-PCI based MSI handling.
- Cleanup historical leftovers all over the place
- Rework the code to utilize more core functionality
- Wrap XEN PCI/MSI interrupts into an irqdomain to make irqdomain
assignment to PCI devices possible.
- Assign irqdomains to PCI devices at initialization time which allows
to utilize the full functionality of hierarchical irqdomains.
- Remove arch_.*_msi_irq() functions from X86 and utilize the irqdomain
which is assigned to the device for interrupt management.
- Make the arch_.*_msi_irq() support conditional on a config switch and
let the last few users select it.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+EUxcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoagLEACGp5U7a4mk24GsOZJDhrua1PHR/fhb
enn/5yOPpxDXdYmtFHIjV5qrNjDTV/WqDlI96KOi+oinG1Eoj0O/MA1AcSRhp6nf
jVdAuK1X0DHDUTEeTAP0JFwqd2j0KlIOphBrIMgeWIf1CRKlYiJaO+ioF9fKgwZ/
/HigOTSykGYMPggm3JXnWTWtJkKSGFxeADBvVHt5RpVmbWtrI4YoSBxKEMtvjyeM
5+GsqbCad1CnFYTN74N+QWVGmgGnUWGEzWsPYnJ9hW+yyjad1kWx3n6NcCWhssaC
E4vAXl6JuCPntL7jBFkbfUkQsgq12ThMZYWpCq8pShJA9O2tDKkxIGasHWrIt4cz
nYrESiv6hM7edjtOvBc086Gd0A2EyGOM879goHyaNVaTO4rI6jfZG7PlW1HHWibS
mf/bdTXBtULGNgEt7T8Qnb8sZ+D01WqzLrq/wm645jIrTzvNHUEpOhT1aH/g4TFQ
cNHD5PcM9OTmiBir9srNd47+1s2mpfwdMYHKBt2QgiXMO8fRgdtr6WLQE4vJjmG8
sA0yGGsgdTKeg2wW1ERF1pWL0Lt05Iaa42Skm0D3BwcOG2n5ltkBHzVllto9cTUh
kIldAOgxGE6QeCnnlrnbHz5mvzt/3Ih/PIKqPSUAC94Kx1yvVHRYuOvDExeO8DFB
P+f0TkrscZObSg==
=JlqV
-----END PGP SIGNATURE-----
Merge tag 'x86-irq-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq updates from Thomas Gleixner:
"Surgery of the MSI interrupt handling to prepare the support of
upcoming devices which require non-PCI based MSI handling:
- Cleanup historical leftovers all over the place
- Rework the code to utilize more core functionality
- Wrap XEN PCI/MSI interrupts into an irqdomain to make irqdomain
assignment to PCI devices possible.
- Assign irqdomains to PCI devices at initialization time which
allows to utilize the full functionality of hierarchical
irqdomains.
- Remove arch_.*_msi_irq() functions from X86 and utilize the
irqdomain which is assigned to the device for interrupt management.
- Make the arch_.*_msi_irq() support conditional on a config switch
and let the last few users select it"
* tag 'x86-irq-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
PCI: MSI: Fix Kconfig dependencies for PCI_MSI_ARCH_FALLBACKS
x86/apic/msi: Unbreak DMAR and HPET MSI
iommu/amd: Remove domain search for PCI/MSI
iommu/vt-d: Remove domain search for PCI/MSI[X]
x86/irq: Make most MSI ops XEN private
x86/irq: Cleanup the arch_*_msi_irqs() leftovers
PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
x86/pci: Set default irq domain in pcibios_add_device()
iommm/amd: Store irq domain in struct device
iommm/vt-d: Store irq domain in struct device
x86/xen: Wrap XEN MSI management into irqdomain
irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
x86/xen: Consolidate XEN-MSI init
x86/xen: Rework MSI teardown
x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
PCI_vmd_Mark_VMD_irqdomain_with_DOMAIN_BUS_VMD_MSI
irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
x86/irq: Initialize PCI/MSI domain at PCI init time
x86/pci: Reducde #ifdeffery in PCI init code
...
Mossberg.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EOVUACgkQEsHwGGHe
VUoOmw//QV8qrglfIeIXQH6+tQRth77m9fgG0HcUWeZV6F/PLHG0xu6fwp15ty3N
mzbtvAki1ts1MuUPI3NBV5wszrgsc3ktuvGfy0+c+Eb/qH4ZZoLWQNixOya934Jk
VRJLzhpjemj+uuwsw5/mFasepoCVKTV7XpuX2rvUtOklRVvxJTRiiDwaMG6e/DiL
gLJFKrceeZbpthdIlpb4+YUNiGfSt9RXIIbbs6hIbM/wPCmeROSIfkyF8r2aPLLs
6EWZ6I3h/Vb1ql1JkXj1cfR9PXkaiQuRjk61G+QQHe7S1E8JHFdAePROz6KOdttb
6/eKpCZiBFTrYdma74z0vgOTVh5/VeyrGQINVif4qxy5Fy04t4GaOadhDq4vUM8t
QMOtOKdjpyHrd0TQFo8zw5AI4xoIApkmkZsDc1o2J5om7dW/crh0fy+O9vuogIOC
lhp4sEddwbhaLMkdb0fvn0b/1YX3r7dGWdxvE50j15B1A7jb59gcYfqJEjuoWZgX
6EfJcGb0b7vKj+yit2vZSd15o3y7TK+NzEL85WQkL9kYFKRrPsCL1WKrHDhSeOLl
fdwa0Odj+VPYvMw4N1+wJ5B744ZQoO6HA7LNW+KfTmcJxmn/GT4gx5J46Qr0nWeO
dUWv1WZkCyQo3bPTpUJvPBeSzBO5qwHAPlueHjWcew0y65l1ll8=
=RNOP
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix making the error message when the opcode bytes at rIP
cannot be accessed during an oops, more precise"
* tag 'x86_core_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/dumpstack: Fix misleading instruction pointer error message
James Morse.
* Add support for controlling per-thread memory bandwidth throttling
delay values on hw which supports it, by Fenghua Yu.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+ENo0ACgkQEsHwGGHe
VUpIAw/+JtO9mP/OxLUUQEkYGMlYWxiJKGxHdI0cnw6gN02TGakVPZS3RAhdrDPP
Oahfl8g2EiC2sXSo0QEMFfZyEc/eOWo17wL1B+wgPfIIxy6KfGe6WtkHMNlOkWOS
zKxUvR93PjSs7e1vS+AMGbqQVFcL4RTSZN5H/QDaBnkxd3O5uLEvUm4pOxPs9FtX
etnK3eM4Uk6qfH9Pa0XZowp2RU0okRsatu+VREkEBplEplA1tusw3u//SlGgi266
Jsy2Pa2S7D0PGaP2D2+eziNmff319AT1mLtZ/0PKjkeZtqq/Sz0MJ9TxkesyEQPH
iv7IWzp+Dfc8Ui5rDNDvOIY+uJxQPMC0qwpU6sZdAgpsCcI5/xiSqTbBz6mxZeql
vTINIs7Lg/FBfkUn52LxbWkl8QA6aLXYr3PwdcFJzyTYmQitYzdEKxn1i+teWKr2
16QHR2GnXIEfc87JuHJpwiToUYZg+5UlVPkFTLNk/2n0gSiJzWMGecuHdS9spToR
vtpt5vmcAJKUptJLwKId+oEHbMLrvDGjXLApD4x3ROeiKGY7Cf1OwNhAmn8QZ8K5
S7wv9hbPZvkByQSsaNgDzzFUuYTP7cR9ILbwkHDixlpLyESnPzAsip5H4rq8gxLn
OwRKFGRvGid72EaapEY3yMA++EfzPfnebUmiLakSfWLHquh+0XQ=
=u3qb
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
- Misc cleanups to the resctrl code in preparation for the ARM side
(James Morse)
- Add support for controlling per-thread memory bandwidth throttling
delay values on hw which supports it (Fenghua Yu)
* tag 'x86_cache_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Enable user to view thread or core throttling mode
x86/resctrl: Enumerate per-thread MBA controls
cacheinfo: Move resctrl's get_cache_id() to the cacheinfo header file
x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps
x86/resctrl: Merge AMD/Intel parse_bw() calls
x86/resctrl: Add struct rdt_membw::arch_needs_linear to explain AMD/Intel MBA difference
x86/resctrl: Use is_closid_match() in more places
x86/resctrl: Include pid.h
x86/resctrl: Use container_of() in delayed_work handlers
x86/resctrl: Fix stale comment
x86/resctrl: Remove struct rdt_membw::max_delay
x86/resctrl: Remove unused struct mbm_state::chunks_bw
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+ENW0ACgkQEsHwGGHe
VUrWjw/+O0S/Bf7RQ2OIDnaHGo5u9k+T+FiklYtTO4klYqtfNEt/DFWVOIThVXBQ
ma4I8Hspj+zUzlq2kqSeqJ2PiikTxRNDqkCUwZhqEFgbXS6/pt8VXXdPniKjeXge
ZE4lcD1RIyDFxzVlKvVaYt1KryZZVVSRqRIChejLrujN23fI6riWfa0W4Bq54J6m
fdiujuDJQ9oroak36dF5Ah6g4g8gL8hBLU9Oyzla9V+1O3GSZuDlwTgDsxZZkmC5
LN4spxwd9tOXOmWhbH7vFfRtQL79KUHkHbUuUvZzZsJ/zs85bxhMa+fUAfjWAEja
brMpD1GZKOcjUM7xzQ9HngMcKD8lWmlsTBTAO9drD89Z949ntjIA4uCY3d3RTJ1q
NoYCV8Xw+8Q8e+zjnMW0tph39LCUEeuccT7t09XP5IF5UEXi5T5S14WoCu5Shnt9
VTQ44NrAxpP7ZNWMpBTaxmr3aXABbdgnvDIxqrohqgQnCnPkWlBJ9FdKj8sQ3y9B
K010ihIb1pWnmTyKGIC3GOWNjwtCpqz9z3gya76tI7EzAejVS6yUqwMohjaWq6JZ
Tz/TtTSTUyczKiCCqoOf7P+5LKrhxjWS8IVBeMqMTeN7osCCIT69U+cox1Ih3DST
pBfy7R3+FXKLHVi/iQv8E+fl3//pTGppKv4MM/wab0E6L+KhqEo=
=NYxb
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Misc minor cleanups"
* tag 'x86_cleanups_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Fix typo in comments for syscall_enter_from_user_mode()
x86/resctrl: Fix spelling in user-visible warning messages
x86/entry/64: Do not include inst.h in calling.h
x86/mpparse: Remove duplicate io_apic.h include
* Move clearcpuid= parameter handling earlier in the boot, away from the
FPU init code and to a generic location, by Mike Hommey.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+ENF4ACgkQEsHwGGHe
VUqZ/BAAvRNhqmvYznS7V1FGJka7noumkxSg/E3ecUnQ9mLXuJlp/k4HVDVq/sEa
8tOXF10snObOy9BlUhpWsT+jUyPz5kNe1h67x9wYOIWq2wO2rkw54Bi7/ZKar0Ik
E93TfiBf8xeas/96Cb6/JgwPzleeLMA6GOcd0cCmFjp++DBw1ydU2ouH2/llKhdT
76eff4ZDvPMwuIaDRSQHY9XHX9mr55TmkVXoxkIilt3YOVE5Ap/Pz+sI0AiD6fJu
DS82EubaCCtaMe2K+lFpq9v3l9sVzR0TTS9hf2uG+M8YpdQdzEU63WPZlXr5tVTH
VReDJEo23Vp3Yy+4u+Ph4CPCa24vjyG5bg1eAYVwyAperelLhsW81tnz6DMaYT3u
NBK5dL0qFZJkIM07rn5Cg/1lGARmidovvYcojwroq/bfNUK5Xxu7Dh+5IIW865Jr
RrqwZkeWR1xXj4G97ICeKZsr1tGRHiLqPow/BXVeFRQxu7qHAQbdyF6tGFJpB/Nh
QaVVeeUr2r+Qyhghd/vg8d3EQDbSYgfxx/8nLB2G4vyNpIIBGeNUo9OptgMnNgr1
44nkQ3qX8VXSz9scTGtW5M2FrO7OfwD5V2V737gsmAb63MHCSORxcFMhKahd63l1
teji0jtpZuOaguCwn+ZAjaDb8FqIOn26vybtd8jC4tzD9Qngt3w=
=lvdt
-----END PGP SIGNATURE-----
Merge tag 'x86_fpu_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Borislav Petkov:
- Allow clearcpuid= to accept multiple bits (Arvind Sankar)
- Move clearcpuid= parameter handling earlier in the boot, away from
the FPU init code and to a generic location (Mike Hommey)
* tag 'x86_fpu_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Handle FPU-related and clearcpuid command line arguments earlier
x86/fpu: Allow multiple bits in clearcpuid= parameter
respective selftests.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EMr8ACgkQEsHwGGHe
VUpoBQ/+Pe7MaMbS7d+IATzhedKW7PNKcQr2g734rnZkz0BgG+1sBkCQPRAKBZoa
SIvt3gURM3aEeD4Dp3am4nLElyhldZrlwKoGKGv6AYv2BpLPQM9PG0fHJGUyYBze
ekKMdPu5YK0hYqoWctrY8h+qbExNdkfAvM7bJMFJMBqypVicm0n5wlgfZthGz0DU
tkD34WZNE2GGAfsi/NNJ2H+hcZo8bQVrqW98bkgdzIA7+KI3cyZZ132VbKxb03tK
A69C7+J4B20q/traWFlb4mTcFy/a1Txrt3cJXIv/Xer74gDMqNYcciGgnTJdhryY
gzBmWNTxuQr1EC8DYJaxjlQbBp6VSAwhELlyc6UeRxLAViEpMxyPfBVMOYqraImc
sZ8QKGgI02PggInN9yo4qCbtUWAGMCHV7HGGW8stVBrh1lia7o6Dy9jhO+nmTnzV
EEe/vEoSsp/ydnkgFNjaRwjFLp+vDX2lAf513ZuZukpt+IGQ0nAO5phzgcZVAyH4
qzr9uXdM3j+NtlXZgLttNppWEvHxzIpkri3Ly46VUFYOqTuKYPmS8A7stEfqx3NO
T8g38+dDirFfKCoJz8NJBUUs+1KXer8QmJvogfHx5fsZ01Q2qz6AmOmsmMfe7Wqm
+C9mvZJOJnW/7NunGWGVsuAZJOPx6o2oivjVpxxOyl8AeQnE6fg=
=itDm
-----END PGP SIGNATURE-----
Merge tag 'x86_fsgsbase_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fsgsbase updates from Borislav Petkov:
"Misc minor cleanups and corrections to the fsgsbase code and
respective selftests"
* tag 'x86_fsgsbase_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86/fsgsbase: Test PTRACE_PEEKUSER for GSBASE with invalid LDT GS
selftests/x86/fsgsbase: Reap a forgotten child
x86/fsgsbase: Replace static_cpu_has() with boot_cpu_has()
x86/entry/64: Correct the comment over SAVE_AND_SET_GSBASE
don't spam the console log, by Chris Down.
* Document how the /proc/cpuinfo machinery works for future reference,
by Kyung Min Park, Ricardo Neri and Dave Hansen.
* Correct the current NMI's duration calculation, by Libing Zhou.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+ELfcACgkQEsHwGGHe
VUoZ4g//V5dvkmniFs/ZWvFmHvCqdtovQgT0X7C3iHxwdGorrP/2Y7+Ocey4Nss0
TmpQXMnrvxSniMBCSh5Fodm9w1th+5IA6lJ/OETza3ANJyGHswQ9P5FfcY0QtGiE
FUPapV3HKY6G3HGDXDLokCPLeQduY1MeKSRzKAw7E+735+TPlZcmCiTqPBRWcBfw
PWNo9eOJuJMnl+7XbCqCBb7Z8o+hb0jqD42Ky6IL9sVMEVT+UtH50kkf0uJciMjl
23ga1LIPMYJ3vw0mPm945ixNpqxkDsYBQx5Q3PyGHAgqWq9TFytffLY3AScGrUpx
EuzOLWmN2zjAEY2Yw454a9fbKj1b2d4M48JmNyKdG0KxH1X9rtoeS4ECJrXAjd0e
BukrieymwSR6CHj4Yp8ZAq0XVtp7AUdeEOWCMTezbr5Fj+C4GWPetHXurM5C5Gj5
Xes0lkA+wFVzPL6R0gkUBsVXVxtZNmMVfn9HKj7bQP8Ar00g5osZ2yFHduXw8BzA
blYowJ2mt54Pb1ertOM3zxKgIqj0PV2vSz8lRPyXzd4xdkkg6WxzP1EtDj5qHARf
+1X2jBWWY1rNosQi2NlzZRO+GiBOPevyVfxZ1KgRkWRtP9hED8kdfVkhLqPBEP6W
DvnKSAAVQiQe+6HP/anNDKy3TzjX5B3/ZgbZDCt1x78swjLXJwE=
=1bX/
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes fromm Borislav Petkov:
- Ratelimit the message about writes to unrecognized MSRs so that they
don't spam the console log (Chris Down)
- Document how the /proc/cpuinfo machinery works for future reference
(Kyung Min Park, Ricardo Neri and Dave Hansen)
- Correct the current NMI's duration calculation (Libing Zhou)
* tag 'x86_misc_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix nmi_handle() duration miscalculation
Documentation/x86: Add documentation for /proc/cpuinfo feature flags
x86/msr: Make source of unrecognised MSR writes unambiguous
x86/msr: Prevent userspace MSR access from dominating the console
devices which doesn't need pinning of pages for DMA anymore. Add support
for the command submission to devices using new x86 instructions like
ENQCMD{,S} and MOVDIR64B. In addition, add support for process address
space identifiers (PASIDs) which are referenced by those command
submission instructions along with the handling of the PASID state on
context switch as another extended state. Work by Fenghua Yu, Ashok Raj,
Yu-cheng Yu and Dave Jiang.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl996DIACgkQEsHwGGHe
VUqM4A/+JDI3GxNyMyBpJR0nQ2vs23ru1o3OxvxhYtcacZ0cNwkaO7g3TLQxH+LZ
k1QtvEd4jqI6BXV4de+HdZFDcqzikJf0KHnUflLTx956/Eop5rtxzMWVo69ZmYs8
QrW0mLhyh8eq19cOHbQBb4M/HFc1DXBw+l7Ft3MeA1divOVESRB/uNxjA25K4PvV
y+pipyUxqKSNhmBFf2bV8OVZloJiEtg3H6XudP0g/rZgjYe3qWxa+2iv6D08yBNe
g7NpMDMql2uo1bcFON7se2oF34poAi49BfiIQb5G4m9pnPyvVEMOCijxCx2FHYyF
nukyxt8g3Uq+UJYoolLNoWijL1jgBWeTBg1uuwsQOqWSARJx8nr859z0GfGyk2RP
GNoYE4rrWBUMEqWk4xeiPPgRDzY0cgcGh0AeuWqNhgBfbbZeGL0t0m5kfytk5i1s
W0YfRbz+T8+iYbgVfE/Zpthc7rH7iLL7/m34JC13+pzhPVTT32ECLJov2Ac8Tt15
X+fOe6kmlDZa4GIhKRzUoR2aEyLpjufZ+ug50hznBQjGrQfcx7zFqRAU4sJx0Yyz
rxUOJNZZlyJpkyXzc12xUvShaZvTcYenHGpxXl8TU3iMbY2otxk1Xdza8pc1LGQ/
qneYgILgKa+hSBzKhXCPAAgSYtPlvQrRizArS8Y0k/9rYaKCfBU=
=K9X4
-----END PGP SIGNATURE-----
Merge tag 'x86_pasid_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PASID updates from Borislav Petkov:
"Initial support for sharing virtual addresses between the CPU and
devices which doesn't need pinning of pages for DMA anymore.
Add support for the command submission to devices using new x86
instructions like ENQCMD{,S} and MOVDIR64B. In addition, add support
for process address space identifiers (PASIDs) which are referenced by
those command submission instructions along with the handling of the
PASID state on context switch as another extended state.
Work by Fenghua Yu, Ashok Raj, Yu-cheng Yu and Dave Jiang"
* tag 'x86_pasid_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction
x86/asm: Carve out a generic movdir64b() helper for general usage
x86/mmu: Allocate/free a PASID
x86/cpufeatures: Mark ENQCMD as disabled when configured out
mm: Add a pasid member to struct mm_struct
x86/msr-index: Define an IA32_PASID MSR
x86/fpu/xstate: Add supervisor PASID state for ENQCMD
x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructions
Documentation/x86: Add documentation for SVA (Shared Virtual Addressing)
iommu/vt-d: Change flags type to unsigned int in binding mm
drm, iommu: Change type of pasid to u32
the new UV5 class of systems, by Mike Travis.
* Use a flexible array for a dynamically sized struct uv_rtc_timer_head,
by Gustavo A. R. Silva.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EKuMACgkQEsHwGGHe
VUp5pA//WHU9Xp3LGAL6FWaKGdiN4KA4+MZyl892LtWUy0RqPSSOP/hovyLfwOo8
aXsaYdm2nlMXNBWNZd0CY+ZsEVV8XOJdbXiqi+Em1QINk3VLJM9rEN9QCALCPAkd
kFdRC6ajJrgTFASa0agBToJiaNndPacZ2ndI8Bx+WIsbBsYx4oln3pCKtmeZNrNk
bWnMTOK+sylILIScmhdgV/czMECAc4sMFF/W/rJdBpoA7duoa6yDycD7E3928o8H
JtMc2wwBLXZSSh3YI7/Z80n7xopQgq3/WRqOjhNHVqRa5rR4ZnleQkXYw1MG2W0H
Gk34aEcFA+M2xJZ2voL/YjngFEgMZG0HNg+AJJBX7K/+B3HvfTjWheTMMdaae7Wr
iW4YFFxaJ5FoB2qFz8sZpln8tbniYxiwxJq9aESjwVqlEPoQ13ZTohAVXBJq+CsZ
NufRJyY7bB5SoDiIucWNdAnJ4bUQDs+UouUqrSORtPqQg/+gJ17krZxQY+oST1I+
BDmAInBzcdScyjAKui7+csgSWW4lgHDUzqYGnFogcpvJc3s9HnqebTNqTwQQvrDM
e8VbuH2X3z5/xgbtE9W3ND3Rf+4CLTMbd6J8SKAceI9sQ1ZOZzLZlJU4TQCbXXoW
vSI5Gv9B3AlI4PGIphdt4M3Sm78fG11PMwuUQEO3WSLjOt5amrM=
=llxK
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- Cleanup different aspects of the UV code and start adding support for
the new UV5 class of systems (Mike Travis)
- Use a flexible array for a dynamically sized struct uv_rtc_timer_head
(Gustavo A. R. Silva)
* tag 'x86_platform_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Update Copyrights to conform to HPE standards
x86/platform/uv: Update for UV5 NMI MMR changes
x86/platform/uv: Update UV5 TSC checking
x86/platform/uv: Update node present counting
x86/platform/uv: Update UV5 MMR references in UV GRU
x86/platform/uv: Adjust GAM MMR references affected by UV5 updates
x86/platform/uv: Update MMIOH references based on new UV5 MMRs
x86/platform/uv: Add and decode Arch Type in UVsystab
x86/platform/uv: Add UV5 direct references
x86/platform/uv: Update UV MMRs for UV5
drivers/misc/sgi-xp: Adjust references in UV kernel modules
x86/platform/uv: Remove SCIR MMR references for UV systems
x86/platform/uv: Remove UV BAU TLB Shootdown Handler
x86/uv/time: Use a flexible array in struct uv_rtc_timer_head
obviates the need to flush cachelines before changing the PTE encryption
bit, by Krish Sadhukhan.
* Add Centaur initialization support for families >= 7, by Tony W
Wang-oc.
* Add a feature flag for, and expose TSX suspend load tracking feature
to KVM, by Cathy Zhang.
* Emulate SLDT and STR so that windows programs don't crash on UMIP
machines, by Brendan Shanks and Ricardo Neri.
* Use the new SERIALIZE insn on Intel hardware which supports it, by
Ricardo Neri.
* Misc cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EKA4ACgkQEsHwGGHe
VUq9/RAAtneeyhaM3Erjj+4gLi383ml7n8hlAixZ7x1T5mkV7B5rKtNQ99AYQTbp
lbA7qu5mfsEqKeu2WY13IMqHKgfgGXPPUumCTYjgQlC6UoSAMjgtGGCDodqBE8i5
iIq6gp59t4aMn1ltkzNMS0hLGX0i3xA6cBHPB6EVdJhQgaX4oVVs1dL8IbXumj3q
NX5qIcaTN/f/IC3EZuCDwZGKUpRF+PWTbD+k02jfhS2crvG/LHJgtisQJ1Eu/ccz
yKfiwCBYS7FcuDTxiJDchz1sXD25dgBmBG26voIukSIPbysAPF7O1MULGvKsztFV
W/6+VMC+KGUs0ACQHhFl5ALXA73zAJskjzNzRRuduYM0Mr2yAckVet2usicnt/Cp
lpmvOpeCjDTPuBQTs0cR9TWjXFeinUkQJOAEcqv9Wh1OKQShZUAJ1jpHwZiDCnhY
kzOhq9GAgKNXxcqcTQD8mIap2/GKIppIxAVb7vPxDQfUhUz/60o0eF3cMkeaa216
31Bnf4h+XtJPoJkDOhI8XrPLw6c3KRWP3i3IoBj+raLiylwzzrIczf/7CBgHoIsa
fEwuM0PUDVurY38VMRlj1dMFBSFw8U7JqKYyvXKwB3KFeyX7SGZDLmdlvhsRTq20
HJepCVldKZvjDq1zvRFyx/TsZQnoDHsIyv5lAC/EKE3S0/XRg2c=
=zXC1
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Add support for hardware-enforced cache coherency on AMD which
obviates the need to flush cachelines before changing the PTE
encryption bit (Krish Sadhukhan)
- Add Centaur initialization support for families >= 7 (Tony W Wang-oc)
- Add a feature flag for, and expose TSX suspend load tracking feature
to KVM (Cathy Zhang)
- Emulate SLDT and STR so that windows programs don't crash on UMIP
machines (Brendan Shanks and Ricardo Neri)
- Use the new SERIALIZE insn on Intel hardware which supports it
(Ricardo Neri)
- Misc cleanups and fixes
* tag 'x86_cpu_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM: SVM: Don't flush cache if hardware enforces cache coherency across encryption domains
x86/mm/pat: Don't flush cache if hardware enforces cache coherency across encryption domnains
x86/cpu: Add hardware-enforced cache coherency as a CPUID feature
x86/cpu/centaur: Add Centaur family >=7 CPUs initialization support
x86/cpu/centaur: Replace two-condition switch-case with an if statement
x86/kvm: Expose TSX Suspend Load Tracking feature
x86/cpufeatures: Enumerate TSX suspend load address tracking instructions
x86/umip: Add emulation/spoofing for SLDT and STR instructions
x86/cpu: Fix typos and improve the comments in sync_core()
x86/cpu: Use XGETBV and XSETBV mnemonics in fpu/internal.h
x86/cpu: Use SERIALIZE in sync_core() when available
encounter an MCE in kernel space but while copying from user memory by
sending them a SIGBUS on return to user space and umapping the faulty
memory, by Tony Luck and Youquan Song.
* memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
* New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
* Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the hw
eval phase and they don't make it into production.
* Misc fixes, improvements and cleanups, as always.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
=u1Wg
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory
by sending them a SIGBUS on return to user space and umapping the
faulty memory, by Tony Luck and Youquan Song.
- memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
- New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
- Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the
hw eval phase and they don't make it into production.
- Misc fixes, improvements and cleanups, as always.
* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
x86/mce: Decode a kernel instruction to determine if it is copying from user
x86/mce: Recover from poison found while copying from user space
x86/mce: Avoid tail copy when machine check terminated a copy from user
x86/mce: Add _ASM_EXTABLE_CPY for copy user access
x86/mce: Provide method to find out the type of an exception handler
x86/mce: Pass pointer to saved pt_regs to severity calculation routines
x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
x86/mce: Add Skylake quirk for patrol scrub reported errors
RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
x86/mce: Annotate mce_rd/wrmsrl() with noinstr
x86/mce/dev-mcelog: Do not update kflags on AMD systems
x86/mce: Stop mce_reign() from re-computing severity for every CPU
x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
x86/mce: Increase maximum number of banks to 64
x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
RAS/CEC: Fix cec_init() prototype
- Userspace support for the Memory Tagging Extension introduced by Armv8.5.
Kernel support (via KASAN) is likely to follow in 5.11.
- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.
- Fix and subsequent rewrite of our Spectre mitigations, including the
addition of support for PR_SPEC_DISABLE_NOEXEC.
- Support for the Armv8.3 Pointer Authentication enhancements.
- Support for ASID pinning, which is required when sharing page-tables with
the SMMU.
- MM updates, including treating flush_tlb_fix_spurious_fault() as a no-op.
- Perf/PMU driver updates, including addition of the ARM CMN PMU driver and
also support to handle CPU PMU IRQs as NMIs.
- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.
- Implementation of ARCH_STACKWALK for unwinding.
- Improve reporting of unexpected kernel traps due to BPF JIT failure.
- Improve robustness of user-visible HWCAP strings and their corresponding
numerical constants.
- Removal of TEXT_OFFSET.
- Removal of some unused functions, parameters and prototypes.
- Removal of MPIDR-based topology detection in favour of firmware
description.
- Cleanups to handling of SVE and FPSIMD register state in preparation
for potential future optimisation of handling across syscalls.
- Cleanups to the SDEI driver in preparation for support in KVM.
- Miscellaneous cleanups and refactoring work.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+AUXMQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNFc1B/4q2Kabe+pPu7s1f58Q+OTaEfqcr3F1qh27
F1YpFZUYxg0GPfPsFrnbJpo5WKo7wdR9ceI9yF/GHjs7A/MSoQJis3pG6SlAd9c0
nMU5tCwhg9wfq6asJtl0/IPWem6cqqhdzC6m808DjeHuyi2CCJTt0vFWH3OeHEhG
cfmLfaSNXOXa/MjEkT8y1AXJ/8IpIpzkJeCRA1G5s18PXV9Kl5bafIo9iqyfKPLP
0rJljBmoWbzuCSMc81HmGUQI4+8KRp6HHhyZC/k0WEVgj3LiumT7am02bdjZlTnK
BeNDKQsv2Jk8pXP2SlrI3hIUTz0bM6I567FzJEokepvTUzZ+CVBi
=9J8H
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's quite a lot of code here, but much of it is due to the
addition of a new PMU driver as well as some arm64-specific selftests
which is an area where we've traditionally been lagging a bit.
In terms of exciting features, this includes support for the Memory
Tagging Extension which narrowly missed 5.9, hopefully allowing
userspace to run with use-after-free detection in production on CPUs
that support it. Work is ongoing to integrate the feature with KASAN
for 5.11.
Another change that I'm excited about (assuming they get the hardware
right) is preparing the ASID allocator for sharing the CPU page-table
with the SMMU. Those changes will also come in via Joerg with the
IOMMU pull.
We do stray outside of our usual directories in a few places, mostly
due to core changes required by MTE. Although much of this has been
Acked, there were a couple of places where we unfortunately didn't get
any review feedback.
Other than that, we ran into a handful of minor conflicts in -next,
but nothing that should post any issues.
Summary:
- Userspace support for the Memory Tagging Extension introduced by
Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.
- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.
- Fix and subsequent rewrite of our Spectre mitigations, including
the addition of support for PR_SPEC_DISABLE_NOEXEC.
- Support for the Armv8.3 Pointer Authentication enhancements.
- Support for ASID pinning, which is required when sharing
page-tables with the SMMU.
- MM updates, including treating flush_tlb_fix_spurious_fault() as a
no-op.
- Perf/PMU driver updates, including addition of the ARM CMN PMU
driver and also support to handle CPU PMU IRQs as NMIs.
- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.
- Implementation of ARCH_STACKWALK for unwinding.
- Improve reporting of unexpected kernel traps due to BPF JIT
failure.
- Improve robustness of user-visible HWCAP strings and their
corresponding numerical constants.
- Removal of TEXT_OFFSET.
- Removal of some unused functions, parameters and prototypes.
- Removal of MPIDR-based topology detection in favour of firmware
description.
- Cleanups to handling of SVE and FPSIMD register state in
preparation for potential future optimisation of handling across
syscalls.
- Cleanups to the SDEI driver in preparation for support in KVM.
- Miscellaneous cleanups and refactoring work"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
Revert "arm64: initialize per-cpu offsets earlier"
arm64: random: Remove no longer needed prototypes
arm64: initialize per-cpu offsets earlier
kselftest/arm64: Check mte tagged user address in kernel
kselftest/arm64: Verify KSM page merge for MTE pages
kselftest/arm64: Verify all different mmap MTE options
kselftest/arm64: Check forked child mte memory accessibility
kselftest/arm64: Verify mte tag inclusion via prctl
kselftest/arm64: Add utilities and a test to validate mte memory
perf: arm-cmn: Fix conversion specifiers for node type
perf: arm-cmn: Fix unsigned comparison to less than zero
arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
arm64: Get rid of arm64_ssbd_state
KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
KVM: arm64: Get rid of kvm_arm_have_ssbd()
KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
...
Only x86 provides try_cmpxchg() outside of the atomic_t interfaces,
provide generic fallbacks to create this interface from the widely
available cmpxchg() function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/159870621515.1229682.15506193091065001742.stgit@devnote2
This seems like a dead artifact since TIF_IA32 is not even defined as a
TI flag for UM. Looking back in git history, it made sense in the old
days, but it is apparently not used since UM was split out of the x86
arch/. It is also going away from the x86 tree soon.
Also, I think the variable clean up it performs is not needed as 64-bit
UML doesn't run 32-bit binaries as far as I can tell, and 32-bit UML
has 32-bit ulong.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
musl toolchain and headers are a bit more strict. These fixes enable building
UML with musl as well as seem not to break on glibc.
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Tested-by: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
- Fix a (hopefully final) IRQ state tracking bug vs. MCE handling
- Fix a documentation link
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Cu/URHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hw7A//VwcVny2/s2E6MN0MrIkbRrzF2wdqgFgK
Mwt/NRBnSwh9GiV53reunzPqZT0Cg2waACNz89wfMR5yi04h2MxV+P1nZayEL5rQ
0lpSDpwrMgtaDm5e4KEmLy220P4M1hpX1AdcFlg8Sx0Yf+2T2xXPUskYxfMdJGE8
LCLhu4B5phKU2gyfVTHdzRWG1RrLaqta1MihW4Z/8wEHZ3wlQooFb6/N6GU5P+Dt
CVaxc8ipXQ4h9n9mGyg2/rsp0UmeelMRmFPn5MD3bcQMzM1k5IzW00MTkpQgNTUW
4pvEy/ftX6LF6fzXc4Jvzzb4HVXCjT8vM+dzR0/WMU9dIK3cwo01+NXzrTzg4GTy
srxy7/Pc6mDJpTxvbS3d6tyR22KH3x1hgUkZ5KZruhzZVApo+CPM7aCTkx8cspUR
JpwOVMe58TmLQgIr420ikzqEpCGmAdju7AdiH5kmPwh6/zASkP+3hZOBENjkXGrn
y5TQxX2YAWRxGq5kZZp17c+vK+dLbCBVZERUUpJGyeONxYwIwg7wuBqETEv2XEdE
nupFtghxVNbHavGY0Alsbh8WAd86KKrp75zSh+5hAWO+KGZhvqhh42+zsKQJJXfQ
Fjag9zcTDXLrAnIV4iOZofgdvkzu2G0o60hhNlvrL0fqklX9hYfcP6Ato4rMj2Xa
H0AISIUsFDE=
=UbZR
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Two fixes:
- Fix a (hopefully final) IRQ state tracking bug vs MCE handling
- Fix a documentation link"
* tag 'x86-urgent-2020-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Fix incorrect references to zero-page.txt
x86/mce: Use idtentry_nmi_enter/exit()
ld's --build-id defaults to "sha1" style, while lld defaults to "fast".
The build IDs are very different between the two, which may confuse
programs that reference them.
Signed-off-by: Bill Wendling <morbo@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Pull v5.10 RCU changes from Paul E. McKenney:
- Debugging for smp_call_function().
- Strict grace periods for KASAN. The point of this series is to find
RCU-usage bugs, so the corresponding new RCU_STRICT_GRACE_PERIOD
Kconfig option depends on both DEBUG_KERNEL and RCU_EXPERT, and is
further disabled by dfefault. Finally, the help text includes
a goodly list of scary caveats.
- New smp_call_function() torture test.
- Torture-test updates.
- Documentation updates.
- Miscellaneous fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to make adding configurable features into seccomp easier,
it's better to have the options at one single location, considering
especially that the bulk of seccomp code is arch-independent. An quick
look also show that many SECCOMP descriptions are outdated; they talk
about /proc rather than prctl.
As a result of moving the config option and keeping it default on,
architectures arm, arm64, csky, riscv, sh, and xtensa did not have SECCOMP
on by default prior to this and SECCOMP will be default in this change.
Architectures microblaze, mips, powerpc, s390, sh, and sparc have an
outdated depend on PROC_FS and this dependency is removed in this change.
Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez1YWz9cnp08UZgeieYRhHdqh-ch7aNwc4JRBnGyrmgfMg@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
[kees: added HAVE_ARCH_SECCOMP help text, tweaked wording]
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/9ede6ef35c847e58d61e476c6a39540520066613.1600951211.git.yifeifz2@illinois.edu
Add asm/mce.h to asm/asm-prototypes.h so that that asm symbol's checksum
can be generated in order to support CONFIG_MODVERSIONS with it and fix:
WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version \
generation failed, symbol will not be versioned.
For reference see:
4efca4ed05 ("kbuild: modversions for EXPORT_SYMBOL() for asm")
334bb77387 ("x86/kbuild: enable modversions for symbols exported from asm")
Fixes: ec6347bb43 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201007111447.GA23257@zn.tnic
Currently, the MOVDIR64B instruction is used to atomically submit
64-byte work descriptors to devices. Although it can encounter errors
like device queue full, command not accepted, device not ready, etc when
writing to a device MMIO, MOVDIR64B can not report back on errors from
the device itself. This means that MOVDIR64B users need to separately
interact with a device to see if a descriptor was successfully queued,
which slows down device interactions.
ENQCMD and ENQCMDS also atomically submit 64-byte work descriptors
to devices. But, they *can* report back errors directly from the
device, such as if the device was busy, or device not enabled or does
not support the command. This immediate feedback from the submission
instruction itself reduces the number of interactions with the device
and can greatly increase efficiency.
ENQCMD can be used at any privilege level, but can effectively only
submit work on behalf of the current process. ENQCMDS is a ring0-only
instruction and can explicitly specify a process context instead of
being tied to the current process or needing to reprogram the IA32_PASID
MSR.
Use ENQCMDS for work submission within the kernel because a Process
Address ID (PASID) is setup to translate the kernel virtual address
space. This PASID is provided to ENQCMDS from the descriptor structure
submitted to the device and not retrieved from IA32_PASID MSR, which is
setup for the current user address space.
See Intel Software Developer’s Manual for more information on the
instructions.
[ bp:
- Make operand constraints like movdir64b() because both insns are
basically doing the same thing, more or less.
- Fixup comments and cleanup. ]
Link: https://lkml.kernel.org/r/20200924180041.34056-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201005151126.657029-3-dave.jiang@intel.com
Carve out the MOVDIR64B inline asm primitive into a generic helper so
that it can be used by other functions. Move it to special_insns.h and
have iosubmit_cmds512() call it.
[ bp: Massage commit message. ]
Suggested-by: Michael Matz <matz@suse.de>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151126.657029-2-dave.jiang@intel.com
All instructions copying data between kernel and user memory
are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY
entries in the exception table. ex_fault_handler_type() returns
EX_HANDLER_UACCESS for both of these.
Recovery is only possible when the machine check was triggered
on a read from user memory. In this case the same strategy for
recovery applies as if the user had made the access in ring3. If
the fault was in kernel memory while copying to user there is no
current recovery plan.
For MOV and MOVZ instructions a full decode of the instruction
is done to find the source address. For MOVS instructions
the source address is in the %rsi register. The function
fault_in_kernel_space() determines whether the source address is
kernel or user, upgrade it from "static" so it can be used here.
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-7-tony.luck@intel.com
Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.
Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:
+ mce_vaddr: will be initialized to the user virtual address of the fault
in the case where the fault occurred in the kernel copying data from
a user address. This is so that kill_me_maybe() can provide that
information to the user SIGBUS handler.
+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
to determine if mce_vaddr is applicable to this error.
Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.
Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.
New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
In the page fault case it is ok to see if a few more unaligned bytes
can be copied from the source address. Worst case is that the page fault
will be triggered again.
Machine checks are more serious. Just give up at the point where the
main copy loop triggered the #MC and return from the copy code as if
the copy succeeded. The machine check handler will use task_work_add() to
make sure that the task is sent a SIGBUS.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-5-tony.luck@intel.com
_ASM_EXTABLE_UA is a general exception entry to record the exception fixup
for all exception spots between kernel and user space access.
To enable recovery from machine checks while coping data from user
addresses it is necessary to be able to distinguish the places that are
looping copying data from those that copy a single byte/word/etc.
Add a new macro _ASM_EXTABLE_CPY and use it in place of _ASM_EXTABLE_UA
in the copy functions.
Record the exception reason number to regs->ax at
ex_handler_uaccess which is used to check MCE triggered.
The new fixup routine ex_handler_copy() is almost an exact copy of
ex_handler_uaccess() The difference is that it sets regs->ax to the trap
number. Following patches use this to avoid trying to copy remaining
bytes from the tail of the copy and possibly hitting the poison again.
New mce.kflags bit MCE_IN_KERNEL_COPYIN will be used by mce_severity()
calculation to indicate that a machine check is recoverable because the
kernel was copying from user space.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-4-tony.luck@intel.com
Avoid a proliferation of ex_has_*_handler() functions by having just
one function that returns the type of the handler (if any).
Drop the __visible attribute for this function. It is not called
from assembler so the attribute is not necessary.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-3-tony.luck@intel.com
New recovery features require additional information about processor
state when a machine check occurred. Pass pt_regs down to the routines
that need it.
No functional change.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-2-tony.luck@intel.com
The UV NMI MMR addresses and fields moved between UV4 and UV5
necessitating a rewrite of the UV NMI handler. Adjust references
to accommodate those changes.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-13-mike.travis@hpe.com
Update check of BIOS TSC sync status to include both possible "invalid"
states provided by newer UV5 BIOS.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-12-mike.travis@hpe.com
The changes in the UV5 arch shrunk the NODE PRESENT table to just 2x64
entries (128 total) so are in to 64 bit MMRs instead of a depth of 64
bits in an array. Adjust references when counting up the nodes present.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-11-mike.travis@hpe.com
Make modifications to the GRU mappings to accommodate changes for UV5.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-10-mike.travis@hpe.com
Make modifications to the GAM MMR mappings to accommodate changes for UV5.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-9-mike.travis@hpe.com
Make modifications to the MMIOH mappings to accommodate changes for UV5.
[ Fix W=1 build warnings. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-8-mike.travis@hpe.com
When the UV BIOS starts the kernel it passes the UVsystab info struct to
the kernel which contains information elements more specific than ACPI,
and generally pertinent only to the MMRs. These are read only fields
so information is passed one way only. A new field starting with UV5 is
the UV architecture type so the ACPI OEM_ID field can be used for other
purposes going forward. The UV Arch Type selects the entirety of the
MMRs available, with their addresses and fields defined in uv_mmrs.h.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-7-mike.travis@hpe.com
Add new references to UV5 (and UVY class) system MMR addresses and
fields primarily caused by the expansion from 46 to 52 bits of physical
memory address.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-6-mike.travis@hpe.com
Update UV MMRs in uv_mmrs.h for UV5 based on Verilog output from the
UV Hub hardware design files. This is the next UV architecture with
a new class (UVY) being defined for 52 bit physical address masks.
Uses a bitmask for UV arch identification so a single test can cover
multiple versions. Includes other adjustments to match the uv_mmrs.h
file to keep from encountering compile errors. New UV5 functionality
is added in the patches that follow.
[ Fix W=1 build warnings. ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-5-mike.travis@hpe.com
UV class systems no longer use System Controller for monitoring of CPU
activity provided by this driver. Other methods have been developed for
BIOS and the management controller (BMC). Remove that supporting code.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-3-mike.travis@hpe.com
The Broadcast Assist Unit (BAU) TLB shootdown handler is being rewritten
to become the UV BAU APIC driver. It is designed to speed up sending
IPIs to selective CPUs within the system. Remove the current TLB
shutdown handler (tlb_uv.c) file and a couple of kernel hooks in the
interim.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Link: https://lkml.kernel.org/r/20201005203929.148656-2-mike.travis@hpe.com
When a group that has TopDown members is failed to be scheduled, any
later TopDown groups will not return valid values.
Here is an example.
A background perf that occupies all the GP counters and the fixed
counter 1.
$perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles,
cycles,cycles}:D" -a
A user monitors a TopDown group. It works well, because the fixed
counter 3 and the PERF_METRICS are available.
$perf stat -x, --topdown -- ./workload
retiring,bad speculation,frontend bound,backend bound,
18.0,16.1,40.4,25.5,
Then the user tries to monitor a group that has TopDown members.
Because of the cycles event, the group is failed to be scheduled.
$perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound,
topdown-fe-bound,topdown-bad-spec,cycles}'
-- ./workload
<not counted>,,slots,0,0.00,,
<not counted>,,topdown-retiring,0,0.00,,
<not counted>,,topdown-be-bound,0,0.00,,
<not counted>,,topdown-fe-bound,0,0.00,,
<not counted>,,topdown-bad-spec,0,0.00,,
<not counted>,,cycles,0,0.00,,
The user tries to monitor a TopDown group again. It doesn't work anymore.
$perf stat -x, --topdown -- ./workload
,,,,,
In a txn, cancel_txn() is to truncate the event_list for a canceled
group and update the number of events added in this transaction.
However, the number of TopDown events added in this transaction is not
updated. The kernel will probably fail to add new Topdown events.
Fixes: 7b2c05a15d ("perf/x86/intel: Generic support for hardware TopDown metrics")
Reported-by: Andi Kleen <ak@linux.intel.com>
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.net
Kan reported that n_metric gets corrupted for cancelled transactions;
a similar issue exists for n_pair for AMD's Large Increment thing.
The problem was confirmed and confirmed fixed by Kim using:
sudo perf stat -e "{cycles,cycles,cycles,cycles}:D" -a sleep 10 &
# should succeed:
sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
# should fail:
sudo perf stat -e "{fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.all,cycles}:D" -a workload
# previously failed, now succeeds with this patch:
sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
Fixes: 5738891229 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lkml.kernel.org/r/20201005082516.GG2628@hirez.programming.kicks-ass.net
The motivations to go rework memcpy_mcsafe() are that the benefit of
doing slow and careful copies is obviated on newer CPUs, and that the
current opt-in list of CPUs to instrument recovery is broken relative to
those CPUs. There is no need to keep an opt-in list up to date on an
ongoing basis if pmem/dax operations are instrumented for recovery by
default. With recovery enabled by default the old "mcsafe_key" opt-in to
careful copying can be made a "fragile" opt-out. Where the "fragile"
list takes steps to not consume poison across cachelines.
The discussion with Linus made clear that the current "_mcsafe" suffix
was imprecise to a fault. The operations that are needed by pmem/dax are
to copy from a source address that might throw #MC to a destination that
may write-fault, if it is a user page.
So copy_to_user_mcsafe() becomes copy_mc_to_user() to indicate
the separate precautions taken on source and destination.
copy_mc_to_kernel() is introduced as a non-SMAP version that does not
expect write-faults on the destination, but is still prepared to abort
with an error code upon taking #MC.
The original copy_mc_fragile() implementation had negative performance
implications since it did not use the fast-string instruction sequence
to perform copies. For this reason copy_mc_to_kernel() fell back to
plain memcpy() to preserve performance on platforms that did not indicate
the capability to recover from machine check exceptions. However, that
capability detection was not architectural and now that some platforms
can recover from fast-string consumption of memory errors the memcpy()
fallback now causes these more capable platforms to fail.
Introduce copy_mc_enhanced_fast_string() as the fast default
implementation of copy_mc_to_kernel() and finalize the transition of
copy_mc_fragile() to be a platform quirk to indicate 'copy-carefully'.
With this in place, copy_mc_to_kernel() is fast and recovery-ready by
default regardless of hardware capability.
Thanks to Vivek for identifying that copy_user_generic() is not suitable
as the copy_mc_to_user() backend since the #MC handler explicitly checks
ex_has_fault_handler(). Thanks to the 0day robot for catching a
performance bug in the x86/copy_mc_to_user implementation.
[ bp: Add the "why" for this change from the 0/2th message, massage. ]
Fixes: 92b0729c34 ("x86/mm, x86/mce: Add memcpy_mcsafe()")
Reported-by: Erwin Tsaur <erwin.tsaur@intel.com>
Reported-by: 0day robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Erwin Tsaur <erwin.tsaur@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/160195562556.2163339.18063423034951948973.stgit@dwillia2-desk3.amr.corp.intel.com
In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.
Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:
On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > However now I see that copy_user_generic() works for the wrong reason.
> > It works because the exception on the source address due to poison
> > looks no different than a write fault on the user address to the
> > caller, it's still just a short copy. So it makes copy_to_user() work
> > for the wrong reason relative to the name.
>
> Right.
>
> And it won't work that way on other architectures. On x86, we have a
> generic function that can take faults on either side, and we use it
> for both cases (and for the "in_user" case too), but that's an
> artifact of the architecture oddity.
>
> In fact, it's probably wrong even on x86 - because it can hide bugs -
> but writing those things is painful enough that everybody prefers
> having just one function.
Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().
Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.
One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.
[ bp: Massage a bit. ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
Most of dma-debug.h is not required by anything outside of kernel/dma.
Move the four declarations needed by dma-mappin.h or dma-ops providers
into dma-mapping.h and dma-map-ops.h, and move the remainder of the
file to kernel/dma/debug.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Merge dma-contiguous.h into dma-map-ops.h, after removing the comment
describing the contiguous allocator into kernel/dma/contigous.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Split out all the bits that are purely for dma_map_ops implementations
and related code into a new <linux/dma-map-ops.h> header so that they
don't get pulled into all the drivers. That also means the architecture
specific <asm/dma-mapping.h> is not pulled in by <linux/dma-mapping.h>
any more, which leads to a missing includes that were pulled in by the
x86 or arm versions in a few not overly portable drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Rejecting non-native endian BTF overlapped with the addition
of support for it.
The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.
Signed-off-by: David S. Miller <davem@davemloft.net>
When running as Xen dom0 the kernel isn't responsible for selecting the
error handling mode, this should be handled by the hypervisor.
So disable setting FF mode when running as Xen pv guest. Not doing so
might result in boot splats like:
[ 7.509696] HEST: Enabling Firmware First mode for corrected errors.
[ 7.510382] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 2.
[ 7.510383] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 3.
[ 7.510384] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 4.
[ 7.510384] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 5.
[ 7.510385] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 6.
[ 7.510386] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 7.
[ 7.510386] mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 8.
Reason is that the HEST ACPI table contains the real number of MCA
banks, while the hypervisor is emulating only 2 banks for guests.
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20200925140751.31381-1-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
An incorrect sizeof is being used, struct attribute ** is not correct,
it should be struct attribute *. Note that since ** is the same size as
* this is not causing any issues. Improve this fix by using sizeof(*attrs)
as this allows us to not even reference the type of the pointer.
Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)")
Fixes: 5168654630 ("x86/events/amd/iommu: Fix sysfs perf attribute groups")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001113900.58889-1-colin.king@canonical.com
It might be possible that different CPUs have different CPU metrics on a
platform. In this case, writing the GLOBAL_CTRL_EN_PERF_METRICS bit to
the GLOBAL_CTRL register of a CPU, which doesn't support the TopDown
perf metrics feature, causes MSR access error.
Current TopDown perf metrics feature is enumerated using the boot CPU's
PERF_CAPABILITIES MSR. The MSR only indicates the boot CPU supports this
feature.
Check the PERF_CAPABILITIES MSR for each CPU. If any CPU doesn't support
the perf metrics feature, disable the feature globally.
Fixes: 59a854e2f3 ("perf/x86/intel: Support TopDown metrics on Ice Lake")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201001211711.25708-1-kan.liang@linux.intel.com
The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
the #PF intercept bit in the exception bitmap when they do not match.
This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
hypervisor can get a vmexit for #PF exceptions even when the
corresponding bit is clear in the exception bitmap.
This is unexpected and is promptly detected by a WARN_ON_ONCE.
To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
Reported-by: Qian Cai <cai@redhat.com>>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that import_iovec handles compat iovecs, the native syscalls
can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that import_iovec handles compat iovecs, the native readv and writev
syscalls can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In common with memoryless domains only register GI domains
if the proximity node is not online. If a domain is already
a memory containing domain, or a memoryless domain there is
nothing to do just because it also contains a Generic Initiator.
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add userspace support for the Memory Tagging Extension introduced by
Armv8.5.
(Catalin Marinas and others)
* for-next/mte: (30 commits)
arm64: mte: Fix typo in memory tagging ABI documentation
arm64: mte: Add Memory Tagging Extension documentation
arm64: mte: Kconfig entry
arm64: mte: Save tags when hibernating
arm64: mte: Enable swap of tagged pages
mm: Add arch hooks for saving/restoring tags
fs: Handle intra-page faults in copy_mount_options()
arm64: mte: ptrace: Add NT_ARM_TAGGED_ADDR_CTRL regset
arm64: mte: ptrace: Add PTRACE_{PEEK,POKE}MTETAGS support
arm64: mte: Allow {set,get}_tagged_addr_ctrl() on non-current tasks
arm64: mte: Restore the GCR_EL1 register after a suspend
arm64: mte: Allow user control of the generated random tags via prctl()
arm64: mte: Allow user control of the tag check mode via prctl()
mm: Allow arm64 mmap(PROT_MTE) on RAM-based files
arm64: mte: Validate the PROT_MTE request via arch_validate_flags()
mm: Introduce arch_validate_flags()
arm64: mte: Add PROT_MTE support to mmap() and mprotect()
mm: Introduce arch_calc_vm_flag_bits()
arm64: mte: Tags-aware aware memcmp_pages() implementation
arm64: Avoid unnecessary clear_user_page() indirection
...
Printing "Bad RIP value" if copy_code() fails can be misleading for
userspace pointers, since copy_code() can fail if the instruction
pointer is valid but the code is paged out. This is because copy_code()
calls copy_from_user_nmi() for userspace pointers, which disables page
fault handling.
This is reproducible in OOM situations, where it's plausible that the
code may be reclaimed in the time between entry into the kernel and when
this message is printed. This leaves a misleading log in dmesg that
suggests instruction pointer corruption has occurred, which may alarm
users.
Change the message to state the error condition more precisely.
[ bp: Massage a bit. ]
Signed-off-by: Mark Mossberg <mark.mossberg@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201002042915.403558-1-mark.mossberg@gmail.com
This patch removes a few ineffectual assignments from the function
crypto_poly1305_setdctxkey.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When nmi_check_duration() is checking the time an NMI handler took to
execute, the whole_msecs value used should be read from the @duration
argument, not from the ->max_duration, the latter being used to store
the current maximal duration.
[ bp: Rewrite commit message. ]
Fixes: 248ed51048 ("x86/nmi: Remove irq_work from the long duration NMI handler")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Changbin Du <changbin.du@gmail.com>
Link: https://lkml.kernel.org/r/20200820025641.44075-1-libing.zhou@nokia-sbell.com
The CRn accessor functions use __force_order as a dummy operand to
prevent the compiler from reordering CRn reads/writes with respect to
each other.
The fact that the asm is volatile should be enough to prevent this:
volatile asm statements should be executed in program order. However GCC
4.9.x and 5.x have a bug that might result in reordering. This was fixed
in 8.1, 7.3 and 6.5. Versions prior to these, including 5.x and 4.9.x,
may reorder volatile asm statements with respect to each other.
There are some issues with __force_order as implemented:
- It is used only as an input operand for the write functions, and hence
doesn't do anything additional to prevent reordering writes.
- It allows memory accesses to be cached/reordered across write
functions, but CRn writes affect the semantics of memory accesses, so
this could be dangerous.
- __force_order is not actually defined in the kernel proper, but the
LLVM toolchain can in some cases require a definition: LLVM (as well
as GCC 4.9) requires it for PIE code, which is why the compressed
kernel has a definition, but also the clang integrated assembler may
consider the address of __force_order to be significant, resulting in
a reference that requires a definition.
Fix this by:
- Using a memory clobber for the write functions to additionally prevent
caching/reordering memory accesses across CRn writes.
- Using a dummy input operand with an arbitrary constant address for the
read functions, instead of a global variable. This will prevent reads
from being reordered across writes, while allowing memory loads to be
cached/reordered across CRn reads, which should be safe.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82602
Link: https://lore.kernel.org/lkml/20200527135329.1172644-1-arnd@arndb.de/
Link: https://lkml.kernel.org/r/20200902232152.3709896-1-nivedita@alum.mit.edu
The recent fix for NMI vs. IRQ state tracking missed to apply the cure
to the MCE handler.
Fixes: ba1f2b2eaa ("x86/entry: Fix NMI vs IRQ state tracking")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/87mu17ism2.fsf@nanos.tec.linutronix.de
Way back in v3.19 Intel and AMD shared the same machine check severity
grading code. So it made sense to add a case for AMD DEFERRED errors in
commit
e3480271f5 ("x86, mce, severity: Extend the the mce_severity mechanism to handle UCNA/DEFERRED error")
But later in v4.2 AMD switched to a separate grading function in
commit
bf80bbd7dc ("x86/mce: Add an AMD severities-grading function")
Belatedly drop the DEFERRED case from the Intel rule list.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200930021313.31810-3-tony.luck@intel.com
The patrol scrubber in Skylake and Cascade Lake systems can be configured
to report uncorrected errors using a special signature in the machine
check bank and to signal using CMCI instead of machine check.
Update the severity calculation mechanism to allow specifying the model,
minimum stepping and range of machine check bank numbers.
Add a new rule to detect the special signature (on model 0x55, stepping
>=4 in any of the memory controller banks).
[ bp: Rewrite it.
aegl: Productize it. ]
Suggested-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200930021313.31810-2-tony.luck@intel.com
There is no particular reason for keeping the "sub 0, %rsp" insn within
the BPF's x64 JIT prologue.
When tail call code was skipping the whole prologue section these 7
bytes that represent the rsp subtraction could not be simply discarded
as the jump target address would be broken. An option to address that
would be to substitute it with nop7.
Right now tail call is skipping only first 11 bytes of target program's
prologue and "sub X, %rsp" is the first insn that is processed, so if
stack depth is zero then this insn could be omitted without the need for
nop7 swap.
Therefore, do not emit the "sub 0, %rsp" in prologue when program is not
making use of R10 register. Also, make the emission of "add X, %rsp"
conditional in tail call code logic and take into account the presence
of mentioned insn when calculating the jump offsets.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200929204653.4325-3-maciej.fijalkowski@intel.com
Back when all of the callee-saved registers where always pushed to stack
in x64 JIT prologue, tail call counter was placed at the bottom of the
BPF program's stack frame that had a following layout:
+-------------+
| ret addr |
+-------------+
| rbp | <- rbp
+-------------+
| |
| free space |
| from: |
| sub $x,%rsp |
| |
+-------------+
| rbx |
+-------------+
| r13 |
+-------------+
| r14 |
+-------------+
| r15 |
+-------------+
| tail call | <- rsp
| counter |
+-------------+
In order to restore the callee saved registers, epilogue needed to
explicitly toss away the tail call counter via "pop %rbx" insn, so that
%rsp would be back at the place where %r15 was stored.
Currently, the tail call counter is placed on stack *before* the callee
saved registers (brackets on rbx through r15 mean that they are now
pushed to stack only if they are used):
+-------------+
| ret addr |
+-------------+
| rbp | <- rbp
+-------------+
| |
| free space |
| from: |
| sub $x,%rsp |
| |
+-------------+
| tail call |
| counter |
+-------------+
( rbx )
+-------------+
( r13 )
+-------------+
( r14 )
+-------------+
( r15 ) <- rsp
+-------------+
For the record, the epilogue insns consist of (assuming all of the
callee saved registers are used by program):
pop %r15
pop %r14
pop %r13
pop %rbx
pop %rcx
leaveq
retq
"pop %rbx" for getting rid of tail call counter was not an option
anymore as it would overwrite the restored value of %rbx register, so it
was changed to use the %rcx register.
Since epilogue can start popping the callee saved registers right away
without any additional work, the "pop %rcx" could be dropped altogether
as "leave" insn will simply move the %rbp to %rsp. IOW, tail call
counter does not need the explicit handling.
Having in mind the explanation above and the actual reason for that,
let's piggy back on "leave" insn for discarding the tail call counter
from stack and remove the "pop %rcx" from epilogue.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200929204653.4325-2-maciej.fijalkowski@intel.com
PCI devices support two variants of the D3 power state: D3hot (main power
present) D3cold (main power removed). Previously struct pci_dev contained:
unsigned int d3_delay; /* D3->D0 transition time in ms */
unsigned int d3cold_delay; /* D3cold->D0 transition time in ms */
"d3_delay" refers specifically to the D3hot state. Rename it to
"d3hot_delay" to avoid ambiguity and align with the ACPI "_DSM for
Specifying Device Readiness Durations" in the PCI Firmware spec r3.2,
sec 4.6.9.
There is no change to the functionality.
Link: https://lore.kernel.org/r/20200730210848.1578826-1-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
An error occues when sampling non-PEBS INST_RETIRED.PREC_DIST(0x01c0)
event.
perf record -e cpu/event=0xc0,umask=0x01/ -- sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (cpu/event=0xc0,umask=0x01/).
/bin/dmesg | grep -i perf may provide additional information.
The idxmsk64 of the event is set to 0. The event never be successfully
scheduled.
The event should be limit to the fixed counter 0.
Fixes: 6017608936 ("perf/x86/intel: Add Icelake support")
Reported-by: Yi, Ammy <ammy.yi@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200928134726.13090-1-kan.liang@linux.intel.com
The "MiB" result of the IMC free-running bandwidth events,
uncore_imc_free_running/read/ and uncore_imc_free_running/write/ are 16
times too small.
The "MiB" value equals the raw IMC free-running bandwidth counter value
times a "scale" which is inaccurate.
The IMC free-running bandwidth events should be incremented per 64B
cache line, not DWs (4 bytes). The "scale" should be 6.103515625e-5.
Fix the "scale" for both Snow Ridge and Ice Lake.
Fixes: 2b3b76b5ec ("perf/x86/intel/uncore: Add Ice Lake server uncore support")
Fixes: ee49532b38 ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200928133240.12977-1-kan.liang@linux.intel.com
Introduced early attributes /sys/devices/uncore_iio_<pmu_idx>/die* are
initialized by skx_iio_set_mapping(), however, for example, for multiple
segment platforms skx_iio_get_topology() returns -EPERM before a list of
attributes in skx_iio_mapping_group will have been initialized.
As a result the list is being NULL. Thus the warning
"sysfs: (bin_)attrs not set by subsystem for group: uncore_iio_*/" appears
and uncore_iio pmus are not available in sysfs. Clear IIO attr_update
to properly handle the cases when topology information cannot be
retrieved.
Fixes: bb42b3d397 ("perf/x86/intel/uncore: Expose an Uncore unit to IIO PMON mapping")
Reported-by: Kyle Meyer <kyle.meyer@hpe.com>
Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexei Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20200928102133.61041-1-alexander.antonov@linux.intel.com
The Jasper Lake processor is also a Tremont microarchitecture. From the
perspective of perf MSR, there is nothing changed compared with
Elkhart Lake.
Share the code path with Elkhart Lake.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1601296242-32763-2-git-send-email-kan.liang@linux.intel.com
The Jasper Lake processor is also a Tremont microarchitecture. From the
perspective of Intel PMU, there is nothing changed compared with
Elkhart Lake.
Share the perf code with Elkhart Lake.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1601296242-32763-1-git-send-email-kan.liang@linux.intel.com
An oops is triggered by the fuzzy test.
[ 327.853081] unchecked MSR access error: RDMSR from 0x70c at rIP:
0xffffffffc082c820 (uncore_msr_read_counter+0x10/0x50 [intel_uncore])
[ 327.853083] Call Trace:
[ 327.853085] <IRQ>
[ 327.853089] uncore_pmu_event_start+0x85/0x170 [intel_uncore]
[ 327.853093] uncore_pmu_event_add+0x1a4/0x410 [intel_uncore]
[ 327.853097] ? event_sched_in.isra.118+0xca/0x240
There are 2 GP counters for each CBOX, but the current code claims 4
counters. Accessing the invalid registers triggers the oops.
Fixes: 6e394376ee ("perf/x86/intel/uncore: Add Intel Icelake uncore support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200925134905.8839-3-kan.liang@linux.intel.com
There are some updates for the Icelake model specific uncore performance
monitors. (The update can be found at 10th generation intel core
processors families specification update Revision 004, ICL068)
1) Counter 0 of ARB uncore unit is not available for software use
2) The global 'enable bit' (bit 29) and 'freeze bit' (bit 31) of
MSR_UNC_PERF_GLOBAL_CTRL cannot be used to control counter behavior.
Needs to use local enable in event select MSR.
Accessing the modified bit/registers will be ignored by HW. Users may
observe inaccurate results with the current code.
The changes of the MSR_UNC_PERF_GLOBAL_CTRL imply that groups cannot be
read atomically anymore. Although the error of the result for a group
becomes a bit bigger, it still far lower than not using a group. The
group support is still kept. Only Remove the *_box() related
implementation.
Since the counter 0 of ARB uncore unit is not available, update the MSR
address for the ARB uncore unit.
There is no change for IMC uncore unit, which only include free-running
counters.
Fixes: 6e394376ee ("perf/x86/intel/uncore: Add Intel Icelake uncore support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200925134905.8839-2-kan.liang@linux.intel.com
Previously, the MSR uncore for the Ice Lake and Tiger Lake are
identical. The code path is shared. However, with recent update, the
global MSR_UNC_PERF_GLOBAL_CTRL register and ARB uncore unit are changed
for the Ice Lake. Split the Ice Lake and Tiger Lake MSR uncore support.
The changes only impact the MSR ops() and the ARB uncore unit. Other
codes can still be shared between the Ice Lake and the Tiger Lake.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200925134905.8839-1-kan.liang@linux.intel.com
7f47d8cc03 ("x86, tracing, perf: Add trace point for MSR accesses") added
tracing of msr read and write, but because of complexity in having
tracepoints in headers, and even more so for a core header like msr.h, not
to mention the bloat a tracepoint adds to inline functions, a helper
function is needed to be called from the header.
Use the new tracepoint_enabled() macro in tracepoint-defs.h to test if the
tracepoint is active before calling the helper function, instead of open
coding the same logic, which requires knowing the internals of a tracepoint.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
the same base for the TSC. This logic is complicated, and we
do not want it to have any effect once the VM is started.
In particular, if any guest started to synchronize its TSCs
with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
the additional effect of kvm_write_tsc code would be uncharted
territory.
Therefore, this patch makes writes to MSR_IA32_TSC behave
essentially the same as writes to MSR_IA32_TSC_ADJUST when
they come from the guest. A new selftest (which passes
both before and after the patch) checks the current semantics
of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
from both the host and the guest.
Upcoming work to remove the special side effects
of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
will be able to build onto this test, adjusting the host side
to use the new APIs and achieve the same effect.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to set up the memory map after KVM_SET_NESTED_STATE;
to do so, move the call to nested_svm_vmrun_msrpm inside the
KVM_REQ_GET_NESTED_STATE_PAGES handler (which is currently
not used by nSVM). This is similar to what VMX does already.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.
To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push filter rules with bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.
If no filter is populated, MSR handling stays identical to before.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-8-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will introduce the concept of MSRs that may not be handled in kernel
space soon. Some MSRs are directly passed through to the guest, effectively
making them handled by KVM from user space's point of view.
This patch introduces all logic required to ensure that MSRs that
user space wants trapped are not marked as direct access for guests.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-7-graf@amazon.com>
[Replace "_idx" with "_slot". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will introduce the concept of MSRs that may not be handled in kernel
space soon. Some MSRs are directly passed through to the guest, effectively
making them handled by KVM from user space's point of view.
This patch introduces all logic required to ensure that MSRs that
user space wants trapped are not marked as direct access for guests.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-6-graf@amazon.com>
[Make terminology a bit more similar to VMX. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare vmx and svm for a subsequent change that ensures the MSR permission
bitmap is set to allow an MSR that userspace is tracking to force a vmx_vmexit
in the guest.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
[agraf: rebase, adapt SVM scheme to nested changes that came in between]
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-5-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.
Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.
This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.
Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.
Signed-off-by: Alexander Graf <graf@amazon.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200925143422.21718-3-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we find an MSR that we can not handle, bubble up that error code as
MSR error return code. Follow up patches will use that to expose the fact
that an MSR is not handled by KVM to user space.
Suggested-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-2-graf@amazon.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "index" to "slot" in struct vmx_uret_msr to align with the
terminology used by common x86's kvm_user_return_msrs, and to avoid
conflating "MSR's ECX index" with "MSR's index into an array".
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-16-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "vmx_msr_index" to "vmx_uret_msrs_list" to associate it with the
uret MSRs array, and to avoid conflating "MSR's ECX index" with "MSR's
index into an array". Similarly, don't use "slot" in the name as that
terminology is claimed by the common x86 "user_return_msrs" mechanism.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-15-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" to vmx_set_guest_msr() to explicitly associate it with the
guest_uret_msrs array, and to differentiate it from vmx_set_msr() as
well as VMX's load/store MSRs.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-14-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "find_msr_entry" to scope it to VMX and to associate it with
guest_uret_msrs. Drop the "entry" so that the function name pairs with
the existing __vmx_find_uret_msr(), which intentionally uses a double
underscore prefix instead of appending "index" or "slot" as those names
are already claimed by other pieces of the user return MSR stack.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-13-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add vmx_setup_uret_msr() to wrap the lookup and manipulation of the uret
MSRs array during setup_msrs(). In addition to consolidating code, this
eliminates move_msr_up(), which while being a very literally description
of the function, isn't exacly helpful in understanding the net effect of
the code.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-12-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move checking for the existence of MSR_EFER in the uret MSR array into
update_transition_efer() so that the lookup and manipulation of the
array in setup_msrs() occur back-to-back. This paves the way toward
adding a helper to wrap the lookup and manipulation.
To avoid unnecessary overhead, defer the lookup until the uret array
would actually be modified in update_transition_efer(). EFER obviously
exists on CPUs that support the dedicated VMCS fields for switching
EFER, and EFER must exist for the guest and host EFER.NX value to
diverge, i.e. there is no danger of attempting to read/write EFER when
it doesn't exist.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-11-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check for RDTSCP support prior to checking if MSR_TSC_AUX is in the uret
MSRs array so that the array lookup and manipulation are back-to-back.
This paves the way toward adding a helper to wrap the lookup and
manipulation.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "__find_msr_index" to scope it to VMX, associate it with
guest_uret_msrs, and to avoid conflating "MSR's ECX index" with "MSR's
array index". Similarly, don't use "slot" in the name so as to avoid
colliding the common x86's half of "user_return_msrs" (the slot in
kvm_user_return_msrs is not the same slot in guest_uret_msrs).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" to "guest_msrs_ready" to explicitly associate it with the
"guest_uret_msrs" array, and replace "ready" with "loaded" to more
precisely reflect what it tracks, e.g. "ready" could be interpreted as
meaning ready for processing (setup_msrs() has run), which is wrong.
"loaded" also aligns with the similar "guest_state_loaded" field.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" into the name of "save_nmsrs" to explicitly associate it with
the guest_uret_msrs array, and replace "save" with "active" (for lack of
a better word) to better describe what is being tracked. While "save"
is more or less accurate when viewed as a literal description of the
field, e.g. it holds the number of MSRs that were saved into the array
the last time setup_msrs() was invoked, it can easily be misinterpreted
by the reader, e.g. as meaning the number of MSRs that were saved from
hardware at some point in the past, or as the number of MSRs that need
to be saved at some point in the future, both of which are wrong.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename vcpu_vmx.nsmrs to vcpu_vmx.nr_uret_msrs to explicitly associate
it with the guest_uret_msrs array.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename struct "shared_msr_entry" to "vmx_uret_msr" to align with x86's
rename of "shared_msrs" to "user_return_msrs", and to call out that the
struct is specific to VMX, i.e. not part of the generic "shared_msrs"
framework. Abbreviate "user_return" as "uret" to keep line lengths
marginally sane and code more or less readable.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "loadstore" to vmx_find_msr_index() to differentiate it from the so
called shared MSRs helpers (which will soon be renamed), and replace
"index" with "slot" to better convey that the helper returns slot in the
array, not the MSR index (the value that gets stuffed into ECX).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "MAX" to the LOADSTORE and so called SHARED MSR defines to make it
more clear that the define controls the array size, as opposed to the
actual number of valid entries that are in the array.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the "shared_msrs" mechanism, which is used to defer restoring
MSRs that are only consumed when running in userspace, to a more banal
but less likely to be confusing "user_return_msrs".
The "shared" nomenclature is confusing as it's not obvious who is
sharing what, e.g. reasonable interpretations are that the guest value
is shared by vCPUs in a VM, or that the MSR value is shared/common to
guest and host, both of which are wrong.
"shared" is also misleading as the MSR value (in hardware) is not
guaranteed to be shared/reused between VMs (if that's indeed the correct
interpretation of the name), as the ability to share values between VMs
is simply a side effect (albiet a very nice side effect) of deferring
restoration of the host value until returning from userspace.
"user_return" avoids the above confusion by describing the mechanism
itself instead of its effects.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move initialization of 'struct kvm_mmu' fields into alloc_mmu_pages() to
consolidate code, and rename the helper to __kvm_mmu_create().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923163314.8181-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Read vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO only if the
VM-Exit is being reflected to L1 now that they are no longer passed
directly to the kvm_nested_vmexit tracepoint.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the newly introduced TRACE_EVENT_KVM_EXIT to define the guts of
kvm_nested_vmexit so that it captures and prints the same information as
kvm_exit. This has the bonus side effect of fixing the interrupt info
and error code printing for the case where they're invalid, e.g. if the
exit was a failed VM-Entry. This also sets the stage for retrieving
EXIT_QUALIFICATION and VM_EXIT_INTR_INFO in nested_vmx_reflect_vmexit()
if and only if the VM-Exit is being routed to L1.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Macrofy the definition of kvm_exit so that the definition can be reused
verbatim by kvm_nested_vmexit.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper, is_exception_with_error_code(), to provide the simple but
difficult to read code of checking for a valid exception with an error
code given a vmcs.VM_EXIT_INTR_INFO value. The helper will gain another
user, vmx_get_exit_info(), in a future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_rip_read() to read the guest's RIP for the nested VM-Exit
tracepoint instead of having the caller pass in an argument. Params
that are passed into a tracepoint are evaluated even if the tracepoint
is disabled, i.e. passing in RIP for VMX incurs a VMREAD and retpoline
to retrieve a value that may never be used, e.g. if the exit is due to a
hardware interrupt.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
tracepoint won't be hit.
Read RIP from within the tracepoint itself to avoid a potential VMREAD
and retpoline if the guest's RIP isn't available.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if KVM attempts to switch to the currently loaded VMCS. Now that
nested_vmx_free_vcpu() doesn't blindly call vmx_switch_vmcs(), all paths
that lead to vmx_switch_vmcs() are implicitly guarded by guest vs. host
mode, e.g. KVM should never emulate VMX instructions when guest mode is
active, and nested_vmx_vmexit() should never be called when host mode is
active.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the explicit switch to vmcs01 and the call to free_nested() in
nested_vmx_free_vcpu(). free_nested(), which is called unconditionally
by vmx_leave_nested(), ensures vmcs01 is loaded prior to freeing vmcs02
and friends.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a WARN in free_nested() to ensure vmcs01 is loaded prior to freeing
vmcs02 and friends, and explicitly switch to vmcs01 if it's not. KVM is
supposed to keep is_guest_mode() and loaded_vmcs==vmcs02 synchronized,
but bugs happen and freeing vmcs02 while it's in use will escalate a KVM
error to a use-after-free and potentially crash the kernel.
Do the WARN and switch even in the !vmxon case to help detect latent
bugs. free_nested() is not a hot path, and the check is cheap.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move free_nested() down below vmx_switch_vmcs() so that a future patch
can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not
the loaded VMCS when freeing nested resources.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call guest_state_valid() directly instead of querying emulation_required
when checking if L1 is attempting VM-Enter with invalid guest state.
If emulate_invalid_guest_state is false, KVM will fixup segment regs to
avoid emulation and will never set emulation_required, i.e. KVM will
incorrectly miss the associated consistency checks because the nested
path stuffs segments directly into vmcs02.
Opportunsitically add Consistency Check tracing to make future debug
suck a little less.
Fixes: 2bb8cafea8 ("KVM: vVMX: signal failure for nested VMEntry if emulation_required")
Fixes: 3184a995f7 ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM
expects vmcs01 to be loaded when is_guest_mode() is false.
Fixes: 671ddc700f ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: stable@vger.kernel.org
Cc: Dan Cross <dcross@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly reset the segment cache after stuffing guest segment regs in
prepare_vmcs02_rare(). Although the cache is reset when switching to
vmcs02, there is nothing that prevents KVM from re-populating the cache
prior to writing vmcs02 with vmcs12's values. E.g. if the vCPU is
preempted after switching to vmcs02 but before prepare_vmcs02_rare(),
kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl()
and cache the stale vmcs02 value. While the current code base only
caches stale data in the preemption case, it's theoretically possible
future code could read a segment register during the nested flow itself,
i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(),
although it did introduce the bug.
This manifests as an unexpected nested VM-Enter failure when running
with unrestricted guest disabled if the above preemption case coincides
with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3
to to a L2 vCPU at CPL0. stack_segment_valid() will see the new SS_SEL
but the old SS_AR_BYTES and incorrectly mark the guest state as invalid
due to SS.dpl != SS.rpl.
Don't bother updating the cache even though prepare_vmcs02_rare() writes
every segment. With unrestricted guest, guest segments are almost never
read, let alone L2 guest segments. On the other hand, populating the
cache requires a large number of memory writes, i.e. it's unlikely to be
a net win. Updating the cache would be a win when unrestricted guest is
not supported, as guest_state_valid() will immediately cache all segment
registers. But, nested virtualization without unrestricted guest is
dirt slow, saving some VMREADs won't change that, and every CPU
manufactured in the last decade supports unrestricted guest. In other
words, the extra (minor) complexity isn't worth the trouble.
Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL
depending on when preemption occurs. This is "ok" in that the usage is
imperfect by nature, i.e. it's used heuristically to improve performance
but doesn't affect functionality. kvm_arch_vcpu_put() could be "fixed"
by also disabling preemption while loading segments, but that's
pointless and misleading as reading state from kvm_sched_{in,out}() is
guaranteed to see stale data in one form or another. E.g. even if all
the usage of regs_avail is fixed to call kvm_register_mark_available()
after the associated state is set, the individual state might still be
stale with respect to the overall vCPU state. I.e. making functional
decisions in an asynchronous hook is doomed from the get go. Thankfully
KVM doesn't do that.
Fixes: de63ad4cf4 ("KVM: X86: implement the logic for spinlock optimization")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use bools to track write and user faults throughout the page fault paths
and down into mmu_set_spte(). The actual usage is purely boolean, but
that's not obvious without digging into all paths as the current code
uses a mix of bools (TDP and try_async_pf) and ints (shadow paging and
mmu_set_spte()).
No true functional change intended (although the pgprintk() will now
print 0/1 instead of 0/PFERR_WRITE_MASK).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the "ITLB multi-hit workaround enabled" check into the callers of
disallowed_hugepage_adjust() to make it more obvious that the helper is
specific to the workaround, and to be consistent with the accounting,
i.e. account_huge_nx_page() is called if and only if the workaround is
enabled.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename 'hlevel', which presumably stands for 'host level', to simply
'level' in FNAME(fetch). The variable hasn't tracked the host level for
quite some time.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Condition the accounting of a disallowed huge NX page on the original
requested level of the page being greater than the current iterator
level. This does two things: accounts the page if and only if a huge
page was actually disallowed, and accounts the shadow page if and only
if it was the level at which the huge page was disallowed. For the
latter case, the previous logic would account all shadow pages used to
create the translation for the forced small page, e.g. even PML4, which
can't be a huge page on current hardware, would be accounted as having
been a disallowed huge page when using 5-level EPT.
The overzealous accounting is purely a performance issue, i.e. the
recovery thread will spuriously zap shadow pages, but otherwise the bad
behavior is harmless.
Cc: Junaid Shahid <junaids@google.com>
Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Apply the "huge page disallowed" adjustment of the max level only after
capturing the original requested level. The requested level will be
used in a future patch to skip adding pages to the list of disallowed
huge pages if a huge page wasn't possible anyways, e.g. if the page
isn't mapped as a huge page in the host.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate huge_page_disallowed in __direct_map() and FNAME(fetch) in
preparation for reworking the calculation so that it preserves the
requested map level and eventually to avoid flagging a shadow page as
being disallowed for being used as a large/huge page when it couldn't
have been huge in the first place, e.g. because the backing page in the
host is not large.
Pass the error code into the helpers and use it to recalcuate exec and
write_fault instead adding yet more booleans to the parameters.
Opportunistically use huge_page_disallowed instead of lpage_disallowed
to match the nomenclature used within the mapping helpers (though even
they have existing inconsistencies).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor the zap loop in kvm_recover_nx_lpages() to be a for loop that
iterates on to_zap and drop the !to_zap check that leads to the in-loop
calling of kvm_mmu_commit_zap_page(). The in-loop commit when to_zap
hits zero is superfluous now that there's an unconditional commit after
the loop to handle the case where lpage_disallowed_mmu_pages is emptied.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call kvm_mmu_commit_zap_page() after exiting the "prepare zap" loop in
kvm_recover_nx_lpages() to finish zapping pages in the unlikely event
that the loop exited due to lpage_disallowed_mmu_pages being empty.
Because the recovery thread drops mmu_lock() when rescheduling, it's
possible that lpage_disallowed_mmu_pages could be emptied by a different
thread without to_zap reaching zero despite to_zap being derived from
the number of disallowed lpages.
Fixes: 1aa9b9572b ("kvm: x86: mmu: Recovery of shattered NX large pages")
Cc: Junaid Shahid <junaids@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183735.584-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename ops.h to vmx_ops.h to allow adding a tdx_ops.h in the future
without causing massive confusion.
Trust Domain Extensions (TDX) is built on VMX, but KVM cannot directly
access the VMCS(es) for a TDX guest, thus TDX will need its own "ops"
implementation for wrapping the low level operations.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183112.3030-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extract the posted interrupt code so that it can be reused for Trust
Domain Extensions (TDX), which requires posted interrupts and can use
KVM VMX's implementation almost verbatim. TDX is different enough from
raw VMX that it is highly desirable to implement the guts of TDX in a
separate file, i.e. reusing posted interrupt code by shoving TDX support
into vmx.c would be a mess.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183112.3030-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Detect spurious page faults, e.g. page faults that occur when multiple
vCPUs simultaneously access a not-present page, and skip the SPTE write,
prefetch, and stats update for spurious faults.
Note, the performance benefits of skipping the write and prefetch are
likely negligible, and the false positive stats adjustment is probably
lost in the noise. The primary motivation is to play nice with TDX's
SEPT in the long term. SEAMCALLs (to program SEPT entries) are quite
costly, e.g. thousands of cycles, and a spurious SEPT update will result
in a SEAMCALL error (which KVM will ideally treat as fatal).
Reported-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923220425.18402-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
values instead of overloading RET_PF_RETRY. In the short term, the
unique values add clarity to the code and RET_PF_SPURIOUS will be used
by set_spte() to avoid unnecessary work for spurious faults.
In the long term, TDX will use RET_PF_FIXED to deterministically map
memory during pre-boot. The page fault flow may bail early for benign
reasons, e.g. if the mmu_notifier fires for an unrelated address. With
only RET_PF_RETRY, it's impossible for the caller to distinguish between
"cool, page is mapped" and "darn, need to try again", and thus cannot
handle benign cases like the mmu_notifier retry.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923220425.18402-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check for RET_PF_EMULATE instead of implicitly doing the same
by checking for !RET_PF_RETRY (RET_PF_INVALID is handled earlier). This
will adding new RET_PF_ types in future patches without breaking the
emulation path.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923220425.18402-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exit to userspace with an error if the MMU is buggy and returns
RET_PF_INVALID when servicing a page fault. This will allow a future
patch to invert the emulation path, i.e. emulate only on RET_PF_EMULATE
instead of emulating on anything but RET_PF_RETRY. This technically
means that KVM will exit to userspace instead of emulating on
RET_PF_INVALID, but practically speaking it's a nop as the MMU never
returns RET_PF_INVALID.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923220425.18402-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recursively zap all to-be-orphaned children, unsynced or otherwise, when
zapping a shadow page for a nested TDP MMU. KVM currently only zaps the
unsynced child pages, but not the synced ones. This can create problems
over time when running many nested guests because it leaves unlinked
pages which will not be freed until the page quota is hit. With the
default page quota of 20 shadow pages per 1000 guest pages, this looks
like a memory leak and can degrade MMU performance.
In a recent benchmark, substantial performance degradation was observed:
An L1 guest was booted with 64G memory.
2G nested Windows guests were booted, 10 at a time for 20
iterations. (200 total boots)
Windows was used in this benchmark because they touch all of their
memory on startup.
By the end of the benchmark, the nested guests were taking ~10% longer
to boot. With this patch there is no degradation in boot time.
Without this patch the benchmark ends with hundreds of thousands of
stale EPT02 pages cluttering up rmaps and the page hash map. As a
result, VM shutdown is also much slower: deleting memslot 0 was
observed to take over a minute. With this patch it takes just a
few miliseconds.
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923221406.16297-3-sean.j.christopherson@intel.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the logic that controls whether or not FNAME(invlpg) needs to flush
fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a
value. This allows a future patch to redefine the return semantics for
mmu_page_zap_pte() so that it can recursively zap orphaned child shadow
pages for nested TDP MMUs.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923221406.16297-2-sean.j.christopherson@intel.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V Synthetic timers require SynIC but we don't seem to check that
upon HV_X64_MSR_STIMER[X]_CONFIG/HV_X64_MSR_STIMER0_COUNT writes. Make
the behavior match synic_set_msr().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200924145757.1035782-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To make kvm_mmu_free_roots() a bit more readable, capture 'kvm' in a
local variable instead of doing vcpu->kvm over and over (and over).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923191204.8410-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper function and several wrapping macros to consolidate the
copy-paste code in vmx_compute_secondary_exec_control() for adjusting
controls that are dependent on guest CPUID bits.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200925003011.21016-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
preparation for consolidating the logic for adjusting secondary exec
controls based on the guest CPUID model.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If PCID is not exposed to the guest, clear INVPCID in the guest's CPUID
even if the VMCS INVPCID enable is not supported. This will allow
consolidating the secondary execution control adjustment code without
having to special case INVPCID.
Technically, this fixes a bug where !CPUID.PCID && CPUID.INVCPID would
result in unexpected guest behavior (#UD instead of #GP/#PF), but KVM
doesn't support exposing INVPCID if it's not supported in the VMCS, i.e.
such a config is broken/bogus no matter what.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename helpers for a few controls to conform to the more prevelant style
of cpu_has_vmx_<feature>(). Consistent names will allow adding macros
to consolidate the boilerplate code for adjusting secondary execution
controls.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The 'arch_haltpoll_disable' is used to disable guest halt poll.
Correct the comments.
Fixes: a1c4423b02 ("cpuidle-haltpoll: disable host side polling when kvm virtualized")
Signed-off-by: Li Qiang <liq3ea@163.com>
Message-Id: <20200924155800.4939-1-liq3ea@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_vcpu_is_illegal_gpa() to check for a legal GPA when validating a
PT output base instead of open coding a clever, but difficult to read,
variant. Code readability is far more important than shaving a few uops
in a slow path.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it
to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper
is not MMU specific and will gain a user that is completely unrelated to
the MMU in a future patch.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the subtly not-a-constant MSR_IA32_RTIT_OUTPUT_BASE_MASK with a
proper helper function to check whether or not the specified base is
valid. Blindly referencing the local 'vcpu' is especially nasty.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop exporting cpuid_query_maxphyaddr() now that it's not being abused
by VMX.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use cpuid_maxphyaddr() instead of cpuid_query_maxphyaddr() for the
RTIT base MSR check. There is no reason to recompute MAXPHYADDR as the
precomputed version is synchronized with CPUID updates, and
MSR_IA32_RTIT_OUTPUT_BASE is not written between stuffing CPUID and
refreshing vcpu->arch.maxphyaddr.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The INVD instruction is emulated as a NOP, just skip the instruction
instead.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <addd41be2fbf50f5f4059e990a2a0cff182d2136.1600972918.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Very similar content is present in four comments in sev.c. Unfortunately
there are small differences that make it harder to place the comment
in sev_clflush_pages itself, but at least we can make it more concise.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The LAUNCH_SECRET command performs encryption of the
launch secret memory contents. Mark pinned pages as
dirty, before unpinning them.
This matches the logic in sev_launch_update_data().
Signed-off-by: Cfir Cohen <cfir@google.com>
Message-Id: <20200808003746.66687-1-cfir@google.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add tracepoints for the early consistency checks in nested_vmx_run().
The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as
there is no architectural way to check VMCS.LAUNCH_STATE, and subtle
bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in
the L1 VMM.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812180615.22372-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 761e416934 created a wrong mask for the
CR3 MBZ bits. According to APM vol 2, only the upper 12 bits are MBZ.
Fixes: 761e416934 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests", 2020-07-08)
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200829004824.4577-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per Intel's SDM, RDPID takes a #UD if it is unsupported, which is more or
less what KVM is emulating when MSR_TSC_AUX is not available. In fact,
there are no scenarios in which RDPID is supposed to #GP.
Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Message-Id: <1598581422-76264-1-git-send-email-robert.hu@linux.intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On successful nested VM-Enter, check for pending interrupts and convert
the highest priority interrupt to a pending posted interrupt if it
matches L2's notification vector. If the vCPU receives a notification
interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
VM-Enter), the pending interrupt (for L1) should be recognized and
processed as a posted interrupt when interrupts become unblocked after
VM-Enter to L2.
This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
trying to inject an interrupt into L2 by setting the appropriate bit in
L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
method of manually moving the vector from PIR->vIRR/RVI). KVM will
observe the IPI while the vCPU is in L1 context and so won't immediately
morph it to a posted interrupt for L2. The pending interrupt will be
seen by vmx_check_nested_events(), cause KVM to force an immediate exit
after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
After handling the VM-Exit, L1 will see that L2 has a pending interrupt
in PIR, send another IPI, and repeat until L2 is killed.
Note, posted interrupts require virtual interrupt deliveriy, and virtual
interrupt delivery requires exit-on-interrupt, ergo interrupts will be
unconditionally unmasked on VM-Enter if posted interrupts are enabled.
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812175129.12172-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add trace_kvm_cr_write and trace_kvm_cr_read for svm.
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <f3031602-db3b-c4fe-b719-d402663b0a2b@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Analyze is_guest_mode() in svm_vcpu_run() instead of svm_exit_handlers_fastpath()
in conformity with VMX version.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1600066548-4343-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework NMI VM-Exit handling to invoke the kernel handler by function
call instead of INTn. INTn microcode is relatively expensive, and
aligning the IRQ and NMI handling will make it easier to update KVM
should some newfangled method for invoking the handlers come along.
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915191505.10355-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the asm blob that invokes the appropriate IRQ handler after VM-Exit
into a proper subroutine. Unconditionally create a stack frame in the
subroutine so that, as objtool sees things, the function has standard
stack behavior. The dynamic stack adjustment makes using unwind hints
problematic.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915191505.10355-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
more generic is_emulatable(), and unconditionally call the new function
in x86_emulate_instruction().
KVM will use the generic hook to support multiple security related
technologies that prevent emulation in one way or another. Similar to
the existing AMD #NPF case where emulation of the current instruction is
not possible due to lack of information, AMD's SEV-ES and Intel's SGX
and TDX will introduce scenarios where emulation is impossible due to
the guest's register state being inaccessible. And again similar to the
existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
i.e. outside of the control of vendor-specific code.
While the cause and architecturally visible behavior of the various
cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
resume or complete shutdown, and SEV-ES and TDX "return" an error, the
impact on the common emulation code is identical: KVM must stop
emulation immediately and resume the guest.
Query is_emulatable() in handle_ud() as well so that the
force_emulation_prefix code doesn't incorrectly modify RIP before
calling emulate_instruction() in the absurdly unlikely scenario that
KVM encounters forced emulation in conjunction with "do not emulate".
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __GFP_ZERO while alloc_page().
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20200916083621.5512-1-lihaiwei.kernel@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, prepare_vmcs02_early() does not check if the "unrestricted guest"
VM-execution control in vmcs12 is turned off and leaves the corresponding
bit on in vmcs02. Due to this setting, vmentry checks which are supposed to
render the nested guest state as invalid when this VM-execution control is
not set, are passing in hardware.
This patch turns off the "unrestricted guest" VM-execution control in vmcs02
if vmcs12 has turned it off.
Suggested-by: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200921081027.23047-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR reads/writes should always access the L1 state, since the (nested)
hypervisor should intercept all the msrs it wants to adjust, and these
that it doesn't should be read by the guest as if the host had read it.
However IA32_TSC is an exception. Even when not intercepted, guest still
reads the value + TSC offset.
The write however does not take any TSC offset into account.
This is documented in Intel's SDM and seems also to happen on AMD as well.
This creates a problem when userspace wants to read the IA32_TSC value and then
write it. (e.g for migration)
In this case it reads L2 value but write is interpreted as an L1 value.
To fix this make the userspace initiated reads of IA32_TSC return L1 value
as well.
Huge thanks to Dave Gilbert for helping me understand this very confusing
semantic of MSR writes.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The following intercept bit has been added to support VMEXIT
for INVPCID instruction:
Code Name Cause
A2h VMEXIT_INVPCID INVPCID instruction
The following bit has been added to the VMCB layout control area
to control intercept of INVPCID:
Byte Offset Bit(s) Function
14h 2 intercept INVPCID
Enable the interceptions when the the guest is running with shadow
page table enabled and handle the tlbflush based on the invpcid
instruction type.
For the guests with nested page table (NPT) support, the INVPCID
feature works as running it natively. KVM does not need to do any
special handling in this case.
AMD documentation for INVPCID feature is available at "AMD64
Architecture Programmer’s Manual Volume 2: System Programming,
Pub. 24593 Rev. 3.34(or later)"
The documentation can be obtained at the links below:
Link: https://www.amd.com/system/files/TechDocs/24593.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255929.11252.17346684135277453258.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new intercept bits have been added in vmcb control area to support
few more interceptions. Here are the some of them.
- INTERCEPT_INVLPGB,
- INTERCEPT_INVLPGB_ILLEGAL,
- INTERCEPT_INVPCID,
- INTERCEPT_MCOMMIT,
- INTERCEPT_TLBSYNC,
Add a new intercept word in vmcb_control_area to support these instructions.
Also update kvm_nested_vmrun trace function to support the new addition.
AMD documentation for these instructions is available at "AMD64
Architecture Programmer’s Manual Volume 2: System Programming, Pub. 24593
Rev. 3.34(or later)"
The documentation can be obtained at the links below:
Link: https://www.amd.com/system/files/TechDocs/24593.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985251547.11252.16994139329949066945.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert all the intercepts to one array of 32 bit vectors in
vmcb_control_area. This makes it easy for future intercept vector
additions. Also update trace functions.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985250813.11252.5736581193881040525.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modify intercept_exceptions to generic intercepts in vmcb_control_area. Use
the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept to
set/clear/test the intercept_exceptions bits.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985250037.11252.1361972528657052410.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modify intercept_dr to generic intercepts in vmcb_control_area. Use
the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
to set/clear/test the intercept_dr bits.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985249255.11252.10000868032136333355.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change intercept_cr to generic intercepts in vmcb_control_area.
Use the new vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
where applicable.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985248506.11252.9081085950784508671.stgit@bmoger-ubuntu>
[Change constant names. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is in preparation for the future intercept vector additions.
Add new functions vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
using kernel APIs __set_bit, __clear_bit and test_bit espectively.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985247876.11252.16039238014239824460.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
host_intercept_exceptions is not used anywhere. Clean it up.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985252277.11252.8819848322175521354.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kernel coding style suggests not to use labels like error1,error2
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827171145.374620-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Another small refactoring.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827171145.374620-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace svm_vcpu_init_msrpm with svm_vcpu_alloc_msrpm, that also allocates
the msr bitmap and add svm_vcpu_free_msrpm to free it.
This will be used later to move the nested msr permission bitmap allocation
to nested.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827171145.374620-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is to be more consistient with VMX, and to support
upcoming addition of vmcb02
Hopefully no functional changes.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827171145.374620-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The 'page' is to hold the vcpu's vmcb so name it as such to
avoid confusion.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200827171145.374620-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All the checks in lapic_timer_int_injected(), __kvm_wait_lapic_expire(), and
these function calls waste cpu cycles when the timer mode is not tscdeadline.
We can observe ~1.3% world switch time overhead by kvm-unit-tests/vmexit.flat
vmcall testing on AMD server. This patch reduces the world switch latency
caused by timer_advance_ns feature when the timer mode is not tscdeadline by
simpling move the check against apic->lapic_timer.expired_tscdeadline much
earlier.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-7-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
fires on a different pCPU which vCPU is running on. This kick costs about
1000 clock cycles and we don't need this when injecting already-expired
timer or when using the VMX preemption timer because
kvm_lapic_expired_hv_timer() is called from the target vCPU.
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot()
and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when
wrmsr MSR_IA32_TSCDEADLINE.
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-3-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return 0 when getting the tscdeadline timer if the lapic is hw disabled.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is missing apic map recalculation after updating DFR, if it is
INIT RESET, in x2apic mode, local apic is software enabled before.
This patch fix it by introducing the function kvm_apic_set_dfr() to
be called in INIT RESET handling path.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1597827327-25055-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The nested VMX controls MSRs can be initialized by the global capability
values stored in vmcs_config.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-6-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The save and ctl pointers are passed uninitialized to kfree() when
svm_set_nested_state() follows the 'goto out_set_gif' path. While the
issue could've been fixed by initializing these on-stack varialbles to
NULL, it seems preferable to eliminate 'out_set_gif' label completely as
it is not actually a failure path and duplicating a single svm_set_gif()
call doesn't look too bad.
[ bp: Drop obscure Addresses-Coverity: tag. ]
Fixes: 6ccbd29ade ("KVM: SVM: nested: Don't allocate VMCB structures on stack")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Joerg Roedel <jroedel@suse.de>
Reported-by: Colin King <colin.king@canonical.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20200914133725.650221-1-vkuznets@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the architecture independent version of hyperv-tlfs.h, commit c55a844f46
removed the "X64" in the symbol names so they would make sense for both x86 and
ARM64. That commit added aliases with the "X64" in the x86 version of hyperv-tlfs.h
so that existing x86 code would continue to compile.
As a cleanup, update the x86 code to use the symbols without the "X64", then remove
the aliases. There's no functional change.
Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Link: https://lore.kernel.org/r/1601130386-11111-1-git-send-email-jsalisbury@linux.microsoft.com
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Switching the DMAR and HPET MSI code to use the generic MSI domain ops
missed to add the flag which tells the core code to update the domain
operations with the defaults. As a consequence the core code crashes
when an interrupt in one of those domains is allocated.
Add the missing flags.
Fixes: 9006c133a4 ("x86/msi: Use generic MSI domain ops")
Reported-by: Qian Cai <cai@redhat.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87wo0fli8b.fsf@nanos.tec.linutronix.de
- Unbreak the magic 'search the timer interrupt' logic in IO/APIC code
which got wreckaged when the core interrupt code made the state
tracking logic stricter. That caused the interrupt line to stay masked
after switching from IO/APIC to PIC delivery mode, which obviously
prevents interrupts from being delivered.
- Make run_on_irqstack_code() typesafe. The function argument is a void
pointer which is then casted to 'void (*fun)(void *). This breaks
Control Flow Integrity checking in clang. Use proper helper functions
for the three variants reuqired.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9wqnATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWEHD/402XehvgOXySy/KuhMgtSAoJ/OElvo
dnZYugCEh/6mllRbAhH0hkWpjxdhmCEYjWjf5Qqj01KYbQRSzYIhitmPyAI/Z2Dk
CqhvRj4Hko7AQGmF97mg/VLdwn6sDIslfprgo3o0pqZcuyDd0QqUgkKLgQpvGDXe
4l4OtiOoo2yHnr9wJiiMSVgHkXgp5QpTYAhnWV5ea9FokwO6/OLwjZAHgBh7MqNY
6hnUQIHL/v4K3aXdxOED55Irf05Yk4OqiXKgNMHNLqd0xf+DBTqlfzvQ+omZpMXY
sF0SML08r2f+KnBGuXlXDeu7MDRJlBO9Jxeq/mLcyLqP1K2nI2Ue3B6dy7AocKX3
p1aUu0TUtRpDNdhMZSHFon1cQfPc8pFZSVkdq0ke2CYd+HleIrklvQzSn7gGrQqJ
ZvWT9zs/XONXqxcMcSS1+XCQdJsRwIqbvY0Vxa68OIhZcKrsP7Ewln/sj46ghzOq
kTbxXyG58anK2ssalJqR148tsDlZBrBlo3/Bm3sITIoAtb8m+jeuqi1fZnvmfkAC
IiSEqY+p7CnNz9pYEA5T4GHDHE2luBA6YryclvCEAvi60XMZ+LW+T5szs6+fOjS+
y4RK2dGgcE0ewlvm0jzWKLvKMdi7iv5c9ndtE1W4qxy/K/uXKvfHe45SpmcFAyqx
FVjhT+kF/HKF1A==
=BcVu
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for the x86 interrupt code:
- Unbreak the magic 'search the timer interrupt' logic in IO/APIC
code which got wreckaged when the core interrupt code made the
state tracking logic stricter.
That caused the interrupt line to stay masked after switching from
IO/APIC to PIC delivery mode, which obviously prevents interrupts
from being delivered.
- Make run_on_irqstack_code() typesafe. The function argument is a
void pointer which is then cast to 'void (*fun)(void *).
This breaks Control Flow Integrity checking in clang. Use proper
helper functions for the three variants reuqired"
* tag 'x86-urgent-2020-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Unbreak check_timer()
x86/irq: Make run_on_irqstack_cond() typesafe
In the architecture independent version of hyperv-tlfs.h, commit c55a844f46
removed the "X64" in the symbol names so they would make sense for both x86 and
ARM64. That commit added aliases with the "X64" in the x86 version of hyperv-tlfs.h
so that existing x86 code would continue to compile.
As a cleanup, update the x86 code to use the symbols without the "X64", then remove
the aliases. There's no functional change.
Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/1601130386-11111-1-git-send-email-jsalisbury@linux.microsoft.com
If we copy less than 8 bytes and if the destination crosses a cache
line, __copy_user_flushcache would invalidate only the first cache line.
This patch makes it invalidate the second cache line as well.
Fixes: 0aed55af88 ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.wiilliams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
with a better API in 5.10 or 5.11, for now this is a fix
that works with existing userspace but keeps the current
ugly API.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9ufLMUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMe9AgAgU3YQ2SktkqEOXjHMLqCH5Y3PKFI
S2anYpoKlH36Q6kzoqtkCj0GVagvdh5+Envz3I/tMdhv3Y/JgZaX1wHAe4cUl9BT
VyoiDBTWkhYRmpUbLYA8AtmgxQw1Hp8srH86rnvVGmLG6zdAa/rgUAKiQgT688Ej
CQvF5H7Zi3viPo2rInNSkgTIgewduqSWkwJ6+h4AQMmNJpbRaeZs45yMYyyu/FIi
hUazy7Rwk2vkWcuTd/sqH9b9y3VCYpN9juRaehEiK8qxXT3ydTU4Tub25BHmvXdr
dx5pShG4P3nAGnfV1qKAemyQcY7sjfMieqN1F3QcsRcxqZgySUm11o2JRw==
=sHsX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
"Five small fixes.
The nested migration bug will be fixed with a better API in 5.10 or
5.11, for now this is a fix that works with existing userspace but
keeps the current ugly API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Add a dedicated INVD intercept routine
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: x86: fix MSR_IA32_TSC read for nested migration
selftests: kvm: Fix assert failure in single-step test
KVM: x86: VMX: Make smaller physical guest address space support user-configurable
The INVD instruction intercept performs emulation. Emulation can't be done
on an SEV guest because the guest memory is encrypted.
Provide a dedicated intercept routine for the INVD intercept. And since
the instruction is emulated as a NOP, just skip it instead.
Fixes: 1654efcbc4 ("KVM: SVM: Add KVM_SEV_INIT command")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <a0b9a19ffa7fef86a3cc700c7ea01cb2731e04e5.1600972918.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use ghcb_set_sw_scratch() to set the GHCB scratch field, which will also
set the corresponding bit in the GHCB valid_bitmap field to denote that
sw_scratch is actually valid.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/ba84deabdf44a7a880454fb351d189c6ad79d4ba.1601041106.git.thomas.lendacky@amd.com
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
toggled inadvertantly skipped the MMU context reset due to the mask
of bits that triggers PDPTR loads also being used to trigger MMU context
resets.
Fixes: 427890aff8 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode")
Fixes: cb957adb4e ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode")
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This API is the equivalent of alloc_pages, except that the returned memory
is guaranteed to be DMA addressable by the passed in device. The
implementation will also be used to provide a more sensible replacement
for DMA_ATTR_NON_CONSISTENT flag.
Additionally dma_alloc_noncoherent is switched over to use dma_alloc_pages
as its backend.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> (MIPS part)
MSR reads/writes should always access the L1 state, since the (nested)
hypervisor should intercept all the msrs it wants to adjust, and these
that it doesn't should be read by the guest as if the host had read it.
However IA32_TSC is an exception. Even when not intercepted, guest still
reads the value + TSC offset.
The write however does not take any TSC offset into account.
This is documented in Intel's SDM and seems also to happen on AMD as well.
This creates a problem when userspace wants to read the IA32_TSC value and then
write it. (e.g for migration)
In this case it reads L2 value but write is interpreted as an L1 value.
To fix this make the userspace initiated reads of IA32_TSC return L1 value
as well.
Huge thanks to Dave Gilbert for helping me understand this very confusing
semantic of MSR writes.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Snow Ridge integrated PCIe3 uncore unit can be used to collect
performance data, e.g. utilization, between PCIe devices, plugged into
the PCIe port, and the components (in M2IOSF) responsible for
translating and managing requests to/from the device. The performance
data is very useful for analyzing the performance of PCIe devices.
The device with the PCIe3 uncore PMON units is owned by the portdrv_pci
driver. Create a PCI sub driver for the PCIe3 uncore PMON units.
Here are some difference between PCIe3 uncore unit and other uncore
pci units.
- There may be several Root Ports on a system. But the uncore counters
only exist in the Root Port A. A user can configure the channel mask
to collect the data from other Root Ports.
- The event format of the PCIe3 uncore unit is the same as IIO unit of
SKX.
- The Control Register of PCIe3 uncore unit is 64 bits.
- The offset of each counters is 8, which is the same as M2M unit of
SNR.
- New MSR addresses for unit control, counter and counter config.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-7-git-send-email-kan.liang@linux.intel.com
Some uncore counters may be located in the configuration space of a PCI
device, which already has a bonded driver. Currently, the uncore driver
cannot register a PCI uncore PMU for these counters, because, to
register a PCI uncore PMU, the uncore driver must be bond to the device.
However, one device can only have one bonded driver.
Add an uncore PCI sub driver to support such kind of devices.
The sub driver doesn't own the device. In initialization, the sub
driver searches the device via pci_get_device(), and register the
corresponding PMU for the device. In the meantime, the sub driver
registers a PCI bus notifier, which is used to notify the sub driver
once the device is removed. The sub driver can unregister the PMU
accordingly.
The sub driver only searches the devices defined in its id table. The
id table varies on different platforms, which will be implemented in the
following platform-specific patch.
Suggested-by: Bjorn Helgaas <helgaas@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-6-git-send-email-kan.liang@linux.intel.com
The PMU unregistration in the uncore PCI sub driver is similar as the
normal PMU unregistration for a PCI device. The codes to unregister a
PCI PMU can be shared.
Factor out uncore_pci_pmu_unregister(), which will be used later.
Use uncore_pci_get_dev_die_info() to replace the codes which retrieve
the socket and die informaion.
The pci_set_drvdata() is not included in uncore_pci_pmu_unregister() as
well, because the uncore PCI sub driver will not touch the private
driver data pointer of the device.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-5-git-send-email-kan.liang@linux.intel.com
The PMU registration in the uncore PCI sub driver is similar as the
normal PMU registration for a PCI device. The codes to register a PCI
PMU can be shared.
Factor out uncore_pci_pmu_register(), which will be used later.
The pci_set_drvdata() is not included in uncore_pci_pmu_register(). The
uncore PCI sub driver doesn't own the PCI device. It will not touch the
private driver data pointer for the device.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-4-git-send-email-kan.liang@linux.intel.com
When an uncore PCI sub driver gets a remove notification, the
corresponding PMU has to be retrieved and unregistered. The codes, which
find the corresponding PMU by comparing the pci_device_id table, can be
shared.
Factor out uncore_pci_find_dev_pmu(), which will be used later.
There is no functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-3-git-send-email-kan.liang@linux.intel.com
The socket and die information is required to register/unregister a PMU
in the uncore PCI sub driver. The codes, which get the socket and die
information from a BUS number, can be shared.
Factor out uncore_pci_get_dev_die_info(), which will be used later.
There is no functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1600094060-82746-2-git-send-email-kan.liang@linux.intel.com
Previously, the uncore driver would say "NB counters detected" on F17h
machines, which don't have NorthBridge (NB) counters. They have Data
Fabric (DF) counters. Just use the pmu.name to inform users which pmu
to use and its associated counter count.
F17h dmesg BEFORE:
amd_uncore: AMD NB counters detected
amd_uncore: AMD LLC counters detected
F17h dmesg AFTER:
amd_uncore: 4 amd_df counters detected
amd_uncore: 6 amd_l3 counters detected
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921144330.6331-5-kim.phillips@amd.com
On Family 19h, the driver checks for a populated 2-bit threadmask in
order to establish that the user wants to measure individual slices,
individual cores (only one can be measured at a time), and lets
the user also directly specify enallcores and/or enallslices if
desired.
Example F19h invocation to measure L3 accesses (event 4, umask 0xff)
by the first thread (id 0 -> mask 0x1) of the first core (id 0) on the
first slice (id 0):
perf stat -a -e instructions,amd_l3/umask=0xff,event=0x4,coreid=0,threadmask=1,sliceid=0,enallcores=0,enallslices=0/ <workload>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921144330.6331-4-kim.phillips@amd.com
Continue to fully populate either one of threadmask or slicemask if the
user doesn't.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921144330.6331-3-kim.phillips@amd.com
Replace AMD_FORMAT_ATTR with the more apropos DEFINE_UNCORE_FORMAT_ATTR
stolen from arch/x86/events/intel/uncore.h. This way we can clearly
see the bit-variants of each of the attributes that want to have
the same name across families.
Also unroll AMD_ATTRIBUTE because we are going to separately add
new attributes that differ between DF and L3.
Also clean up the if-Family 17h-else logic in amd_uncore_init.
This is basically a rewrite of commit da6adaea2b
("perf/x86/amd/uncore: Update sysfs attributes for Family17h processors").
No functional changes.
Tested F17h+ /sys/bus/event_source/devices/amd_{l3,df}/format/*
content remains unchanged:
/sys/bus/event_source/devices/amd_l3/format/event:config:0-7
/sys/bus/event_source/devices/amd_l3/format/umask:config:8-15
/sys/bus/event_source/devices/amd_df/format/event:config:0-7,32-35,59-60
/sys/bus/event_source/devices/amd_df/format/umask:config:8-15
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921144330.6331-2-kim.phillips@amd.com
Several people reported in the kernel bugzilla that between v4.12 and v4.13
the magic which works around broken hardware and BIOSes to find the proper
timer interrupt delivery mode stopped working for some older affected
platforms which need to fall back to ExtINT delivery mode.
The reason is that the core code changed to keep track of the masked and
disabled state of an interrupt line more accurately to avoid the expensive
hardware operations.
That broke an assumption in i8259_make_irq() which invokes
disable_irq_nosync();
irq_set_chip_and_handler();
enable_irq();
Up to v4.12 this worked because enable_irq() unconditionally unmasked the
interrupt line, but after the state tracking improvements this is not
longer the case because the IO/APIC uses lazy disabling. So the line state
is unmasked which means that enable_irq() does not call into the new irq
chip to unmask it.
In principle this is a shortcoming of the core code, but it's more than
unclear whether the core code should try to reset state. At least this
cannot be done unconditionally as that would break other existing use cases
where the chip type is changed, e.g. when changing the trigger type, but
the callers expect the state to be preserved.
As the way how check_timer() is switching the delivery modes is truly
unique, the obvious fix is to simply unmask the i8259 manually after
changing the mode to ExtINT delivery and switching the irq chip to the
legacy PIC.
Note, that the fixes tag is not really precise, but identifies the commit
which broke the assumptions in the IO/APIC and i8259 code and that's the
kernel version to which this needs to be backported.
Fixes: bf22ff45be ("genirq: Avoid unnecessary low level irq function calls")
Reported-by: p_c_chan@hotmail.com
Reported-by: ecm4@mail.com
Reported-by: perdigao1@yahoo.com
Reported-by: matzes@users.sourceforge.net
Reported-by: rvelascog@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: p_c_chan@hotmail.com
Tested-by: matzes@users.sourceforge.net
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197769
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-09-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).
The main changes are:
1) Full multi function support in libbpf, from Andrii.
2) Refactoring of function argument checks, from Lorenz.
3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.
4) Program metadata support, from YiFei.
5) bpf iterator optimizations, from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
Since smaller physical address spaces are only supported on VMX, the
parameter is only exposed in the kvm_intel module.
For now disable support by default, and let the user decide if they want
to enable it.
Modifications to VMX page fault and EPT violation handling will depend
on whether that parameter is enabled.
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Message-Id: <20200903141122.72908-1-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
compat_sys_mount is identical to the regular sys_mount now, so remove it
and use the native version everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two minor conflicts:
1) net/ipv4/route.c, adding a new local variable while
moving another local variable and removing it's
initial assignment.
2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
One pretty prints the port mode differently, whilst another
changes the driver to try and obtain the port mode from
the port node rather than the switch node.
Signed-off-by: David S. Miller <davem@davemloft.net>
Sami reported that run_on_irqstack_cond() requires the caller to cast
functions to mismatching types, which trips indirect call Control-Flow
Integrity (CFI) in Clang.
Instead of disabling CFI on that function, provide proper helpers for
the three call variants. The actual ASM code stays the same as that is
out of reach.
[ bp: Fix __run_on_irqstack() prototype to match. ]
Fixes: 931b941459 ("x86/entry: Provide helpers for executing on the irqstack")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Cc: <stable@vger.kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1052
Link: https://lkml.kernel.org/r/87pn6eb5tv.fsf@nanos.tec.linutronix.de
FPU initialization handles them currently. However, in the case
of clearcpuid=, some other early initialization code may check for
features before the FPU initialization code is called. Handling the
argument earlier allows the command line to influence those early
initializations.
Signed-off-by: Mike Hommey <mh@glandium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200921215638.37980-1-mh@glandium.org
- fix fault on page table writes during instruction fetch
s390:
- doc improvement
x86:
- The obvious patches are always the ones that turn out to be
completely broken. /me hangs his head in shame.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9nyjsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNpcAf/bsW1B+Q8QJnyfU4RSiX28lG8Ki9F
9A0aVJPW4U/x7COZuhldrQGkbHDA5agavCevghMuOqWkz2gs6ihpAGgzfG+FVIm7
2yi4k9A90kPrMSBf8qaLgvybGNO6uxGpJmv54MjHpkLPUEz+J1MuB9D6eEqBZkWz
ncOSsGS2eeUFpqulA9DCN3O3PbaFeAXPNJnDNGqxrGjV7CriosRlbK02PVxTQzvT
nuGzDgaOmmRXntIQ7hrk9DJlHm7gH2jH8TK9gB2xm0yuVm2/nNlpkY6rP6NDUdLs
OrJOzxWOcSO8HRgBhlFhED/8heTqHCJS1vMUI3M6Z62p324TjRDjRC+4ow==
=tpZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- fix fault on page table writes during instruction fetch
s390:
- doc improvement
x86:
- The obvious patches are always the ones that turn out to be
completely broken. /me hangs his head in shame"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
Revert "KVM: Check the allocation of pv cpu mask"
KVM: arm64: Remove S1PTW check from kvm_vcpu_dabt_iswrite()
KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch
docs: kvm: add documentation for KVM_CAP_S390_DIAG318
* Disable relocation relaxation for the compressed kernel when not built
as -pie as in that case kernels built with clang and linked with LLD
fail to boot due to the linker optimizing some instructions in non-PIE
form; the gory details in the commit message, from Arvind Sankar.
* A fix for the "bad bp value" warning issued by the frame-pointer
unwinder, from Josh Poimboeuf.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl9nqkgACgkQEsHwGGHe
VUo+IQ/+LkX6IfkBmNmyq2+XB7tzCFltBd5F1zTsJXYESPakQRQg085joHvqRAsY
uhxGiLHv4IcyReccshV8TsPYBuocFv4ULPw1VG8xxEUmUvs325USMTZH0rV/JkuF
px/S77IrjIrDsoHv5di2nhjoIRYQbzuKdNKRYvaBPgprhE8odbViVU4swNLRa2VN
jJ4JXva27ZqIIq6EPRnE8vHAk60MlzVhyutFi6YpiM/xvLkVAyQI97r6Pbk4gXVi
SPMS3y4uAb6BJ7kwp1ieFwK499hyCnXyQyn7bcmX/kBV8Br7T3Fhzw5Qrt7SFk2e
AG03DIf5beOoxpXE7GoPsAafeB6GIk8mSEeSoIlY+dnb6GZgksUK9uW2MMuKahxS
uIHjCMvlSs7R6FPw24clqzdYhKOIMnXrCMarrS7lit61dZqkoCEdKJEAsJ9+5Mz/
A+VWfFv6IzNm7xkgtV6yxNisK53k6DL67rf3ULxPOPLERUP2TZDtDEyhoShne5ok
41TJXDs8Ag5pcR1gpsE/vZDKT1ohO7DoYZCR3kJh5u/HT6nXPmoKx4ppi8aiUXUC
bg8lUevFX80rgrEgQA3v53DurZPxBAWx8EFAEuV5DSfb32D7NDwlPwbYVQuS6Un7
rZrnEhIIMVdlhCfqdnmVBt5PcN1fjQ0G2cdK36DSS6c2j5ZYSJw=
=og1U
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.9_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- A defconfig fix (Daniel Díaz)
- Disable relocation relaxation for the compressed kernel when not
built as -pie as in that case kernels built with clang and linked
with LLD fail to boot due to the linker optimizing some instructions
in non-PIE form; the gory details in the commit message (Arvind
Sankar)
- A fix for the "bad bp value" warning issued by the frame-pointer
unwinder (Josh Poimboeuf)
* tag 'x86_urgent_for_v5.9_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/fp: Fix FP unwinding in ret_from_fork
x86/boot/compressed: Disable relocation relaxation
x86/defconfigs: Explicitly unset CONFIG_64BIT in i386_defconfig
The commit 0f99022210 ("KVM: Check the allocation of pv cpu mask") we
have in 5.9-rc5 has two issue:
1) Compilation fails for !CONFIG_SMP, see:
https://bugzilla.kernel.org/show_bug.cgi?id=209285
2) This commit completely disables PV TLB flush, see
https://lore.kernel.org/kvm/87y2lrnnyf.fsf@vitty.brq.redhat.com/
The allocation problem is likely a theoretical one, if we don't
have memory that early in boot process we're likely doomed anyway.
Let's solve it properly later.
This reverts commit 0f99022210.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In some hardware implementations, coherency between the encrypted and
unencrypted mappings of the same physical page in a VM is enforced. In
such a system, it is not required for software to flush the VM's page
from all CPU caches in the system prior to changing the value of the
C-bit for the page.
So check that bit before flushing the cache.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lkml.kernel.org/r/20200917212038.5090-4-krish.sadhukhan@oracle.com
- Add support for the Lakefield chip to the RAPL power capping
driver (Ricardo Neri).
- Modify the ACPI processor idle driver to prevent it from triggering
RCU-lockdep complaints which has started to happen after recent
changes in that area (Peter Zijlstra).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl9k1mwSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxuAcQAIY8XeVHRUD37n6LwRbyia3XNDXo0WxD
NWtRaSK6cRBCQVyHhILXwivJSLAsk0iVAMpuTzifb6AN9smnA7ZwhZhWEzwcKzXO
fBNuUPcyqiJqQYekUdSxMDmSyXVYCmRKHcdOfI6R0Yl028prx87U/h4ukWjyRLOZ
C/Ix7wTWxWvwcU0oo1FTjBZyFxTmMHzJXT9E9U4O69yR8yIWS67LCM30yAo8OQjx
hxyg4z5hJG+Z4z1SOJmH9luZ9egBjopU6PvZBhbGP1+DNdstKyeaOHnbDv+iNz5g
sofbQ28dWC1PykRy8YAnTz1VrjfYeiQUHTS3aci/v6pqbLMxrqKEpZ5K+815rpw3
KKPaD4WSXN/bn03yv1OxUJG/2vwrY9sVnqwVSgOqm7c0aTfmD9f0Y8VO0xk1bgUy
cwR3HzKb3Ig5xxQ7ultiEBV8oUKAu+SiG4gUGDmVxUR110N5jEZH9kqBw6nuAvNW
BYUVMyH0KRVb2J2r3y8kRla9gLv/MvExDoKZc3NQ6VmmKDXVBwGRPn3LpTVZHlLj
L3m2qQpTD+H+/ydGjxML7YNgW+07u7rjAxI6Qr2d2OgXyTFPXmKzI/wqWwuYI686
IUTRSxVv5O+WA2rNjuFBGLFE4GnkkBROrUWfpgse5ZeTX61ZG9fX4mN325yba9Hk
/MSuLChk+gZu
=gBKv
-----END PGP SIGNATURE-----
Merge tag 'pm-5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add a new CPU ID to the RAPL power capping driver and prevent
the ACPI processor idle driver from triggering RCU-lockdep complaints.
Specifics:
- Add support for the Lakefield chip to the RAPL power capping driver
(Ricardo Neri).
- Modify the ACPI processor idle driver to prevent it from triggering
RCU-lockdep complaints which has started to happen after recent
changes in that area (Peter Zijlstra)"
* tag 'pm-5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor: Take over RCU-idle for C3-BM idle
cpuidle: Allow cpuidle drivers to take over RCU-idle
ACPI: processor: Use CPUIDLE_FLAG_TLB_FLUSHED
ACPI: processor: Use CPUIDLE_FLAG_TIMER_STOP
powercap: RAPL: Add support for Lakefield
Currently the callback passed to arch_stack_walk() has an argument called
reliable passed to it to indicate if the stack entry is reliable, a comment
says that this is used by some printk() consumers. However in the current
kernel none of the arch_stack_walk() implementations ever set this flag to
true and the only callback implementation we have is in the generic
stacktrace code which ignores the flag. It therefore appears that this
flag is redundant so we can simplify and clarify things by removing it.
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/20200914153409.25097-2-broonie@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
They do get called from the #MC handler which is already marked
"noinstr".
Commit
e2def7d49d ("x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR")
already got rid of the instrumentation in the MSR accessors, fix the
annotation now too, in order to get rid of:
vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200915194020.28807-1-bp@alien8.de
In some hardware implementations, coherency between the encrypted and
unencrypted mappings of the same physical page is enforced. In such a
system, it is not required for software to flush the page from all CPU
caches in the system prior to changing the value of the C-bit for the
page. So check that bit before flushing the cache.
[ bp: Massage commit message. ]
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200917212038.5090-3-krish.sadhukhan@oracle.com
In some hardware implementations, coherency between the encrypted and
unencrypted mappings of the same physical page is enforced. In such a system,
it is not required for software to flush the page from all CPU caches in the
system prior to changing the value of the C-bit for a page. This hardware-
enforced cache coherency is indicated by EAX[10] in CPUID leaf 0x8000001f.
[ bp: Use one of the free slots in word 3. ]
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200917212038.5090-2-krish.sadhukhan@oracle.com
There have been some reports of "bad bp value" warnings printed by the
frame pointer unwinder:
WARNING: kernel stack regs at 000000005bac7112 in sh:1014 has bad 'bp' value 0000000000000000
This warning happens when unwinding from an interrupt in
ret_from_fork(). If entry code gets interrupted, the state of the
frame pointer (rbp) may be undefined, which can confuse the unwinder,
resulting in warnings like the above.
There's an in_entry_code() check which normally silences such
warnings for entry code. But in this case, ret_from_fork() is getting
interrupted. It recently got moved out of .entry.text, so the
in_entry_code() check no longer works.
It could be moved back into .entry.text, but that would break the
noinstr validation because of the call to schedule_tail().
Instead, initialize each new task's RBP to point to the task's entry
regs via an encoded frame pointer. That will allow the unwinder to
reach the end of the stack gracefully.
Fixes: b9f6976bfb ("x86/entry/64: Move non entry code into .text section")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f366bbf5a8d02e2318ee312f738112d0af74d16f.1600103007.git.jpoimboe@redhat.com
This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram
Both points are related to each other since without 1), 2) could not be
achieved.
In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax // two new bytes if bpf_tail_call() is used in this
// function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15
Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'
This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.
If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."
Implement that suggestion.
Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.
Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.
Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:
ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4
ffffffffc0513aa6: 5b pop %rbx
ffffffffc0513aa7: 58 pop %rax
ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp
ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffffc0513ab4: 48 89 df mov %rbx,%rdi
When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.
One might ask why there simply can not be pushes after the nop5?
In the following example snippet:
ffffffffc037030c: 48 89 fb mov %rdi,%rbx
(...)
ffffffffc0370332: 5b pop %rbx
ffffffffc0370333: 58 pop %rax
ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp
ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp
ffffffffc0370347: 50 push %rax
ffffffffc0370348: 53 push %rbx
ffffffffc0370349: 48 89 df mov %rbx,%rdi
ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548
There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.
Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.
To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail
A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:
First install tail call f: C->D->B(f)
* poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
* poke the tailcall (poke->tailcall_target) and do NOT touch the
poke->tailcall_bypass
Remove tail call: B(f')->C(f')
* poke->tailcall_bypass is poked back to jump, then we wait the RCU
grace period so that other programs will finish its execution and
after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
* same as first step
This way CPU can never be exposed to "unwind, tail" state.
Last but not least, when tailcalls get mixed with bpf2bpf calls, it
would be possible to encounter the endless loop due to clearing the
tailcall counter if for example we would use the tailcall3-like from BPF
selftests program that would be subprogram-based, meaning the tailcall
would be present within the BPF subprogram.
This test, broken down to particular steps, would do:
entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
func0 -> call subprog_tail
(we are NOT skipping the first 11 bytes of prologue and this subprogram
has a tailcall, therefore we clear the counter...)
subprog -> do the same thing as entry
and then loop forever.
To address this, the idea is to go through the call chain of bpf2bpf progs
and look for a tailcall presence throughout whole chain. If we saw a single
tail call then each node in this call chain needs to be marked as a subprog
that can reach the tailcall. We would later feed the JIT with this info
and:
- set eax to 0 only when tailcall is reachable and this is the entry prog
- if tailcall is reachable but there's no tailcall in insns of currently
JITed prog then push rax anyway, so that it will be possible to
propagate further down the call chain
- finally if tailcall is reachable, then we need to precede the 'call'
insn with mov rax, [rbp - (stack_depth + 8)]
Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.
[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reflect the actual purpose of poke->ip and rename it to
poke->tailcall_target so that it will not the be confused with another
poke target that will be introduced in next commit.
While at it, do the same thing with poke->ip_stable - rename it to
poke->tailcall_target_stable.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, %rax is used to store the jump target when BPF program is
emitting the retpoline instructions that are handling the indirect
tailcall.
There is a plan to use %rax for different purpose, which is storing the
tail call counter. In order to preserve this value across the tailcalls,
adjust the BPF indirect tailcalls so that the target program will reside
in %rcx and teach the retpoline instructions about new location of jump
target.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A PASID is allocated for an "mm" the first time any thread binds to an
SVA-capable device and is freed from the "mm" when the SVA is unbound
by the last thread. It's possible for the "mm" to have different PASID
values in different binding/unbinding SVA cycles.
The mm's PASID (non-zero for valid PASID or 0 for invalid PASID) is
propagated to a per-thread PASID MSR for all threads within the mm
through IPI, context switch, or inherited. This is done to ensure that a
running thread has the right PASID in the MSR matching the mm's PASID.
[ bp: s/SVM/SVA/g; massage. ]
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1600187413-163670-10-git-send-email-fenghua.yu@intel.com
Currently, the ENQCMD feature depends on CONFIG_IOMMU_SUPPORT. Add
X86_FEATURE_ENQCMD to the disabled features mask so that it gets
disabled when the IOMMU config option above is not enabled.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1600187413-163670-9-git-send-email-fenghua.yu@intel.com
The IA32_PASID MSR (0xd93) contains the Process Address Space Identifier
(PASID), a 20-bit value. Bit 31 must be set to indicate the value
programmed in the MSR is valid. Hardware uses the PASID to identify a
process address space and direct responses to the right address space.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1600187413-163670-7-git-send-email-fenghua.yu@intel.com
The ENQCMD instruction reads a PASID from the IA32_PASID MSR. The
MSR is stored in the task's supervisor XSAVE* PASID state and is
context-switched by XSAVES/XRSTORS.
[ bp: Add (in-)definite articles and massage. ]
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1600187413-163670-6-git-send-email-fenghua.yu@intel.com
Work submission instruction comes in two flavors. ENQCMD can be called
both in ring 3 and ring 0 and always uses the contents of a PASID MSR
when shipping the command to the device. ENQCMDS allows a kernel driver
to submit commands on behalf of a user process. The driver supplies the
PASID value in ENQCMDS. There isn't any usage of ENQCMD in the kernel as
of now.
The CPU feature flag is shown as "enqcmd" in /proc/cpuinfo.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1600187413-163670-5-git-send-email-fenghua.yu@intel.com
Fold the misaligned u64 workarounds into the main quotactl flow instead
of implementing a separate compat syscall handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a helper to check if the calling syscall needs a fixup for
non-natural 64-bit type alignment in the compat ABI. This will only
return true for i386 syscalls on x86_64.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
lift the compat_s64 and compat_u64 definitions into common code using the
COMPAT_FOR_U64_ALIGNMENT symbol for the x86 special case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new field 'dma_range_map' in struct device is used to facilitate the
use of single or multiple offsets between mapping regions of cpu addrs and
dma addrs. It subsumes the role of "dev->dma_pfn_offset" which was only
capable of holding a single uniform offset and had no region bounds
checking.
The function of_dma_get_range() has been modified so that it takes a single
argument -- the device node -- and returns a map, NULL, or an error code.
The map is an array that holds the information regarding the DMA regions.
Each range entry contains the address offset, the cpu_start address, the
dma_start address, and the size of the region.
of_dma_configure() is the typical manner to set range offsets but there are
a number of ad hoc assignments to "dev->dma_pfn_offset" in the kernel
driver code. These cases now invoke the function
dma_direct_set_offset(dev, cpu_addr, dma_addr, size).
Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com>
[hch: various interface cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Make acpi_processor_idle() use the generic TLB flushing code.
This again removes RCU usage after rcu_idle_enter().
(XXX make every C3 invalidate TLBs, not just C3-BM)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because of system-specific EFI firmware limitations, EFI volatile
variables may not be capable of holding the required contents of
the Machine Owner Key (MOK) certificate store when the certificate
list grows above some size. Therefore, an EFI boot loader may pass
the MOK certs via a EFI configuration table created specifically for
this purpose to avoid this firmware limitation.
An EFI configuration table is a much more primitive mechanism
compared to EFI variables and is well suited for one-way passage
of static information from a pre-OS environment to the kernel.
This patch adds initial kernel support to recognize, parse,
and validate the EFI MOK configuration table, where named
entries contain the same data that would otherwise be provided
in similarly named EFI variables.
Additionally, this patch creates a sysfs binary file for each
EFI MOK configuration table entry found. These files are read-only
to root and are provided for use by user space utilities such as
mokutil.
A subsequent patch will load MOK certs into the trusted platform
key ring using this infrastructure.
Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Link: https://lore.kernel.org/r/20200905013107.10457-2-lszubowi@redhat.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Get rid of all the gunk and remove the 'select PCI_MSI_ARCH_FALLBACK' from
the x86 Kconfig so the weak functions in the PCI core are replaced by stubs
which emit a warning, which ensures that any fail to set the irq domain
pointer results in a warning when the device is used.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112334.086003720@linutronix.de
The arch_.*_msi_irq[s] fallbacks are compiled in whether an architecture
requires them or not. Architectures which are fully utilizing hierarchical
irq domains should never call into that code.
It's not only architectures which depend on that by implementing one or
more of the weak functions, there is also a bunch of drivers which relies
on the weak functions which invoke msi_controller::setup_irq[s] and
msi_controller::teardown_irq.
Make the architectures and drivers which rely on them select them in Kconfig
and if not selected replace them by stub functions which emit a warning and
fail the PCI/MSI interrupt allocation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112333.992429909@linutronix.de
Now that interrupt remapping sets the irqdomain pointer when a PCI device
is added it's possible to store the default irq domain in the device struct
in pcibios_add_device().
If the bus to which a device is connected has an irq domain associated then
this domain is used otherwise the default domain (PCI/MSI native or XEN
PCI/MSI) is used. Using the bus domain ensures that special MSI bus domains
like VMD work.
This makes XEN and the non-remapped native case work solely based on the
irq domain pointer in struct device for PCI/MSI and allows to remove the
arch fallback and make most of the x86_msi ops private to XEN in the next
steps.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112333.900423047@linutronix.de
To allow utilizing the irq domain pointer in struct device it is necessary
to make XEN/MSI irq domain compatible.
While the right solution would be to truly convert XEN to irq domains, this
is an exercise which is not possible for mere mortals with limited XENology.
Provide a plain irqdomain wrapper around XEN. While this is blatant
violation of the irqdomain design, it's the only solution for a XEN igorant
person to make progress on the issue which triggered this change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20200826112333.622352798@linutronix.de
X86 cannot store the irq domain pointer in struct device without breaking
XEN because the irq domain pointer takes precedence over arch_*_msi_irqs()
fallbacks.
To achieve this XEN MSI interrupt management needs to be wrapped into an
irq domain.
Move the x86_msi ops setup into a single function to prepare for this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20200826112333.420224092@linutronix.de
X86 cannot store the irq domain pointer in struct device without breaking
XEN because the irq domain pointer takes precedence over arch_*_msi_irqs()
fallbacks.
XENs MSI teardown relies on default_teardown_msi_irqs() which invokes
arch_teardown_msi_irq(). default_teardown_msi_irqs() is a trivial iterator
over the msi entries associated to a device.
Implement this loop in xen_teardown_msi_irqs() to prepare for removal of
the fallbacks for X86.
This is a preparatory step to wrap XEN MSI alloc/free into a irq domain
which in turn allows to store the irq domain pointer in struct device and
to use the irq domain functions directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20200826112333.326841410@linutronix.de
The only user is in the same file and the name is too generic because this
function is only ever used for HVM domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross<jgross@suse.com>
Link: https://lore.kernel.org/r/20200826112333.234097629@linutronix.de
No point in initializing the default PCI/MSI interrupt domain early and no
point to create it when XEN PV/HVM/DOM0 are active.
Move the initialization to pci_arch_init() and convert it to init ops so
that XEN can override it as XEN has it's own PCI/MSI management. The XEN
override comes in a later step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112332.859209894@linutronix.de
Adding a function call before the first #ifdef in arch_pci_init() triggers
a 'mixed declarations and code' warning if PCI_DIRECT is enabled.
Use stub functions and move the #ifdeffery to the header file where it is
not in the way.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112332.767707340@linutronix.de
pci_msi_get_hwirq() and pci_msi_set_desc are not longer special. Enable the
generic MSI domain ops in the core and PCI MSI code unconditionally and get
rid of the x86 specific implementations in the X86 MSI code and in the
hyperv PCI driver.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200826112332.564274859@linutronix.de
Convert the interrupt remap drivers to retrieve the pci device from the msi
descriptor and use info::hwirq.
This is the first step to prepare x86 for using the generic MSI domain ops.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20200826112332.466405395@linutronix.de
Retrieve the PCI device from the msi descriptor instead of doing so at the
call sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200826112332.352583299@linutronix.de
Move the UV specific fields into their own struct for readability sake. Get
rid of the #ifdeffery as it does not matter at all whether the alloc info
is a couple of bytes longer or not.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112332.255792469@linutronix.de
Move the IOAPIC specific fields into their own struct and reuse the common
devid. Get rid of the #ifdeffery as it does not matter at all whether the
alloc info is a couple of bytes longer or not.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20200826112332.054367732@linutronix.de
struct irq_alloc_info is a horrible zoo of unnamed structs in a union. Many
of the struct fields can be generic and don't have to be type specific like
hpet_id, ioapic_id...
Provide a generic set of members to prepare for the consolidation. The goal
is to make irq_alloc_info have the same basic member as the generic
msi_alloc_info so generic MSI domain ops can be reused and yet more mess
can be avoided when (non-PCI) device MSI support comes along.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112331.849577844@linutronix.de
Now that the iommu implementations handle the X86_*_GET_PARENT_DOMAIN
types, consolidate the two getter functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20200826112331.741909337@linutronix.de
irq_remapping_ir_irq_domain() is used to retrieve the remapping parent
domain for an allocation type. irq_remapping_irq_domain() is for retrieving
the actual device domain for allocating interrupts for a device.
The two functions are similar and can be unified by using explicit modes
for parent irq domain retrieval.
Add X86_IRQ_ALLOC_TYPE_IOAPIC/HPET_GET_PARENT and use it in the iommu
implementations. Drop the parent domain retrieval for PCI_MSI/X as that is
unused.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20200826112331.436350257@linutronix.de
Setting the irq_set_vcpu_affinity() callback to
irq_chip_set_vcpu_affinity_parent() is a pointless exercise because the
function which utilizes it searchs the domain hierarchy to find a parent
domain which has such a callback.
Remove the useless indirection.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112331.250130127@linutronix.de
Composing the MSI message at the MSI chip level is wrong because the
underlying parent domain is the one which knows how the message should be
composed for the direct vector delivery or the interrupt remapping table
entry.
The interrupt remapping aware PCI/MSI chip does that already. Make the
direct delivery chip do the same and move the composition of the direct
delivery MSI message to the vector domain irq chip.
This prepares for the upcoming device MSI support to avoid having
architecture specific knowledge in the device MSI domain irq chips.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200826112331.157603198@linutronix.de
The documentation of irq_chip_compose_msi_msg() claims that with
hierarchical irq domains the first chip in the hierarchy which has an
irq_compose_msi_msg() callback is chosen. But the code just keeps
iterating after it finds a chip with a compose callback.
The x86 HPET MSI implementation relies on that behaviour, but that does not
make it more correct.
The message should always be composed at the domain which manages the
underlying resource (e.g. APIC or remap table) because that domain knows
about the required layout of the message.
On X86 the following hierarchies exist:
1) vector -------- PCI/MSI
2) vector -- IR -- PCI/MSI
The vector domain has a different message format than the IR (remapping)
domain. So obviously the PCI/MSI domain can't compose the message without
having knowledge about the parent domain, which is exactly the opposite of
what hierarchical domains want to achieve.
X86 actually has two different PCI/MSI chips where #1 has a compose
callback and #2 does not. #2 delegates the composition to the remap domain
where it belongs, but #1 does it at the PCI/MSI level.
For the upcoming device MSI support it's necessary to change this and just
let the first domain which can compose the message take care of it. That
way the top level chip does not have to worry about it and the device MSI
code does not need special knowledge about topologies. It just sets the
compose callback to NULL and lets the hierarchy pick the first chip which
has one.
Due to that the attempt to move the compose callback from the direct
delivery PCI/MSI domain to the vector domain made the system fail to boot
with interrupt remapping enabled because in the remapping case
irq_chip_compose_msi_msg() keeps iterating and choses the compose callback
of the vector domain which obviously creates the wrong format for the remap
table.
Break out of the loop when the first irq chip with a compose callback is
found and fixup the HPET code temporarily. That workaround will be removed
once the direct delivery compose callback is moved to the place where it
belongs in the vector domain.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200826112331.047917603@linutronix.de
The mcelog utility is not commonly used on AMD systems. Therefore,
errors logged only by the dev_mce_log() notifier will be missed. This
may occur if the EDAC modules are not loaded, in which case it's
preferable to print the error record by the default notifier.
However, the mce->kflags set by dev_mce_log() notifier makes the
default notifier skip over the errors assuming they are processed by
dev_mce_log().
Do not update kflags in the dev_mce_log() notifier on AMD systems.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200903234531.162484-3-Smita.KoralahalliChannabasappa@amd.com
Back in commit:
20d51a426f ("x86/mce: Reuse one of the u16 padding fields in 'struct mce'")
a field was added to "struct mce" to save the computed error severity.
Make use of this in mce_reign() to avoid re-computing the severity
for every CPU.
In the case where the machine panics, one call to mce_severity() is
still needed in order to provide the correct message giving the reason
for the panic.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200908175519.14223-2-tony.luck@intel.com
Yunhai Zhang recently fixed a VGA software scrollback bug in commit
ebfdfeeae8 ("vgacon: Fix for missing check in scrollback handling"),
but that then made people look more closely at some of this code, and
there were more problems on the vgacon side, but also the fbcon software
scrollback.
We don't really have anybody who maintains this code - probably because
nobody actually _uses_ it any more. Sure, people still use both VGA and
the framebuffer consoles, but they are no longer the main user
interfaces to the kernel, and haven't been for decades, so these kinds
of extra features end up bitrotting and not really being used.
So rather than try to maintain a likely unused set of code, I'll just
aggressively remove it, and see if anybody even notices. Maybe there
are people who haven't jumped on the whole GUI badnwagon yet, and think
it's just a fad. And maybe those people use the scrollback code.
If that turns out to be the case, we can resurrect this again, once
we've found the sucker^Wmaintainer for it who actually uses it.
Reported-by: NopNop Nop <nopitydays@gmail.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: 张云海 <zhangyunhai@nsfocus.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The x86-64 psABI [0] specifies special relocation types
(R_X86_64_[REX_]GOTPCRELX) for indirection through the Global Offset
Table, semantically equivalent to R_X86_64_GOTPCREL, which the linker
can take advantage of for optimization (relaxation) at link time. This
is supported by LLD and binutils versions 2.26 onwards.
The compressed kernel is position-independent code, however, when using
LLD or binutils versions before 2.27, it must be linked without the -pie
option. In this case, the linker may optimize certain instructions into
a non-position-independent form, by converting foo@GOTPCREL(%rip) to $foo.
This potential issue has been present with LLD and binutils-2.26 for a
long time, but it has never manifested itself before now:
- LLD and binutils-2.26 only relax
movq foo@GOTPCREL(%rip), %reg
to
leaq foo(%rip), %reg
which is still position-independent, rather than
mov $foo, %reg
which is permitted by the psABI when -pie is not enabled.
- GCC happens to only generate GOTPCREL relocations on mov instructions.
- CLang does generate GOTPCREL relocations on non-mov instructions, but
when building the compressed kernel, it uses its integrated assembler
(due to the redefinition of KBUILD_CFLAGS dropping -no-integrated-as),
which has so far defaulted to not generating the GOTPCRELX
relocations.
Nick Desaulniers reports [1,2]:
"A recent change [3] to a default value of configuration variable
(ENABLE_X86_RELAX_RELOCATIONS OFF -> ON) in LLVM now causes Clang's
integrated assembler to emit R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX
relocations. LLD will relax instructions with these relocations based
on whether the image is being linked as position independent or not.
When not, then LLD will relax these instructions to use absolute
addressing mode (R_RELAX_GOT_PC_NOPIC). This causes kernels built with
Clang and linked with LLD to fail to boot."
Patch series [4] is a solution to allow the compressed kernel to be
linked with -pie unconditionally, but even if merged is unlikely to be
backported. As a simple solution that can be applied to stable as well,
prevent the assembler from generating the relaxed relocation types using
the -mrelax-relocations=no option. For ease of backporting, do this
unconditionally.
[0] https://gitlab.com/x86-psABIs/x86-64-ABI/-/blob/master/x86-64-ABI/linker-optimization.tex#L65
[1] https://lore.kernel.org/lkml/20200807194100.3570838-1-ndesaulniers@google.com/
[2] https://github.com/ClangBuiltLinux/linux/issues/1121
[3] https://reviews.llvm.org/rGc41a18cf61790fc898dcda1055c3efbf442c14c0
[4] https://lore.kernel.org/lkml/20200731202738.2577854-1-nivedita@alum.mit.edu/
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200812004308.1448603-1-nivedita@alum.mit.edu
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
x86:
- nSVM state restore fixes
- Async page fault fixes
- Lots of small fixes everywhere
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9dM5kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM+Iwf+LbISO7ccpPMK1kKtOeug/jZv+xQA
sVaBGRzYo+k2e0XtV8E8IV4N30FBtYSwXsbBKkMAoy2FpmMebgDWDQ7xspb6RJMS
/y8t1iqPwdOaLIkUkgc7UihSTlZm05Es3f3q6uZ9+oaM4Fe+V7xWzTUX4Oy89JO7
KcQsTD7pMqS4bfZGADK781ITR/WPgCi0aYx5s6dcwcZAQXhb1K1UKEjB8OGKnjUh
jliReJtxRA16rjF+S5aJ7L07Ce/ksrfwkI4NXJ4GxW+lyOfVNdSBJUBaZt1m7G2M
1We5+i5EjKCjuxmgtUUUfVdazpj1yl+gBGT7KKkLte9T9WZdXyDnixAbvg==
=OFb3
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"A bit on the bigger side, mostly due to me being on vacation, then
busy, then on parental leave, but there's nothing worrisome.
ARM:
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced (dirty
logging, for example)
- Fix tracing output of 64bit values
x86:
- nSVM state restore fixes
- Async page fault fixes
- Lots of small fixes everywhere"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: emulator: more strict rsm checks.
KVM: nSVM: more strict SMM checks when returning to nested guest
SVM: nSVM: setup nested msr permission bitmap on nested state load
SVM: nSVM: correctly restore GIF on vmexit from nesting after migration
x86/kvm: don't forget to ACK async PF IRQ
x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro
KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit
KVM: SVM: avoid emulation with stale next_rip
KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN
KVM: SVM: Periodically schedule when unregistering regions on destroy
KVM: MIPS: Change the definition of kvm type
kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed
KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control
KVM: fix memory leak in kvm_io_bus_unregister_dev()
KVM: Check the allocation of pv cpu mask
KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
KVM: arm64: Update page shift if stage 2 block mapping not supported
KVM: arm64: Fix address truncation in traces
KVM: arm64: Do not try to map PUDs when they are folded into PMD
arm64/x86: KVM: Introduce steal-time cap
...
Don't ignore return values in rsm_load_state_64/32 to avoid
loading invalid state from SMM state area if it was tampered with
by the guest.
This is primarly intended to avoid letting guest set bits in EFER
(like EFER.SVME when nesting is disabled) by manipulating SMM save area.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827171145.374620-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* check that guest is 64 bit guest, otherwise the SVM related fields
in the smm state area are not defined
* If the SMM area indicates that SMM interrupted a running guest,
check that EFER.SVME which is also saved in this area is set, otherwise
the guest might have tampered with SMM save area, and so indicate
emulation failure which should triple fault the guest.
* Check that that guest CPUID supports SVM (due to the same issue as above)
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827162720.278690-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This code was missing and was forcing the L2 run with L1's msr
permission bitmap
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827162720.278690-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently code in svm_set_nested_state copies the current vmcb control
area to L1 control area (hsave->control), under assumption that
it mostly reflects the defaults that kvm choose, and later qemu
overrides these defaults with L2 state using standard KVM interfaces,
like KVM_SET_REGS.
However nested GIF (which is AMD specific thing) is by default is true,
and it is copied to hsave area as such.
This alone is not a big deal since on VMexit, GIF is always set to false,
regardless of what it was on VM entry. However in nested_svm_vmexit we
were first were setting GIF to false, but then we overwrite the control
fields with value from the hsave area. (including the nested GIF field
itself if GIF virtualization is enabled).
Now on normal vm entry this is not a problem, since GIF is usually false
prior to normal vm entry, and this is the value that copied to hsave,
and then restored, but this is not always the case when the nested state
is loaded as explained above.
To fix this issue, move svm_set_gif after we restore the L1 control
state in nested_svm_vmexit, so that even with wrong GIF in the
saved L1 control area, we still clear GIF as the spec says.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200827162720.278690-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge commit 26d05b368a ("Merge branch 'kvm-async-pf-int' into HEAD")
tried to adapt the new interrupt based async PF mechanism to the newly
introduced IDTENTRY magic but unfortunately it missed the fact that
DEFINE_IDTENTRY_SYSVEC() doesn't call ack_APIC_irq() on its own and
all DEFINE_IDTENTRY_SYSVEC() users have to call it manually.
As the result all multi-CPU KVM guest hang on boot when
KVM_FEATURE_ASYNC_PF_INT is present. The breakage went unnoticed because no
KVM userspace (e.g. QEMU) currently set it (and thus async PF mechanism
is currently disabled) but we're about to change that.
Fixes: 26d05b368a ("Merge branch 'kvm-async-pf-int' into HEAD")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200908135350.355053-3-vkuznets@redhat.com>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
Don't report internal error and freeze guest when event delivery causes
an APIC-access exit, it is handleable and the event will be re-injected
during the next vmentry.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
svm->next_rip is reset in svm_vcpu_run() only after calling
svm_exit_handlers_fastpath(), which will cause SVM's
skip_emulated_instruction() to write a stale RIP.
We can move svm_exit_handlers_fastpath towards the end of
svm_vcpu_run(). To align VMX with SVM, keep svm_complete_interrupts()
close as well.
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paul K. <kronenpj@kronenpj.dyndns.org>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Also move vmcb_mark_all_clean before any possible write to the VMCB.
- Paolo]
Even without in-kernel LAPIC we should allow writing '0' to
MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
particular, QEMU with 'kernel-irqchip=off' fails to start
a guest with
qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0
Fixes: 9d3c447c72 ("KVM: X86: Fix async pf caused null-ptr-deref")
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200911093147.484565-1-vkuznets@redhat.com>
[Actually commit the version proposed by Sean Christopherson. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There may be many encrypted regions that need to be unregistered when a
SEV VM is destroyed. This can lead to soft lockups. For example, on a
host running 4.15:
watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
CPU: 206 PID: 194348 Comm: t_virtual_machi
RIP: 0010:free_unref_page_list+0x105/0x170
...
Call Trace:
[<0>] release_pages+0x159/0x3d0
[<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
[<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
[<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
[<0>] kvm_arch_destroy_vm+0x47/0x200
[<0>] kvm_put_kvm+0x1a8/0x2f0
[<0>] kvm_vm_release+0x25/0x30
[<0>] do_exit+0x335/0xc10
[<0>] do_group_exit+0x3f/0xa0
[<0>] get_signal+0x1bc/0x670
[<0>] do_signal+0x31/0x130
Although the CLFLUSH is no longer issued on every encrypted region to be
unregistered, there are no other changes that can prevent soft lockups for
very large SEV VMs in the latest kernel.
Periodically schedule if necessary. This still holds kvm->lock across the
resched, but since this only happens when the VM is destroyed this is
assumed to be acceptable.
Signed-off-by: David Rientjes <rientjes@google.com>
Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When kvm_mmu_get_page() gets a page with unsynced children, the spt
pagetable is unsynchronized with the guest pagetable. But the
guest might not issue a "flush" operation on it when the pagetable
entry is changed from zero or other cases. The hypervisor has the
responsibility to synchronize the pagetables.
KVM behaved as above for many years, But commit 8c8560b833
("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
inadvertently included a line of code to change it without giving any
reason in the changelog. It is clear that the commit's intention was to
change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
needlessly flush other contexts; however, one of the hunks changed
a nearby KVM_REQ_MMU_SYNC instead. This patch changes it back.
Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
fixes: 8c8560b833 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
in exit_ctls_high.
Fixes: 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when
successful.
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3. It loads them into
vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.
At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.
The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.
After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
When L2 is in PAE paging mode add a call to ept_load_pdptrs before
leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
vcpu->arch.walk_mmu->pdptrs[].
Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.
Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
failures.
Signed-off-by: Peter Shier <pshier@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200820230545.2411347-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl9SGP4PHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDWSoP/iZdsgkcMXM1qGRpMtn0FDVQiL2zd8QOEki+
/ldkeLvcUC9aqVOWdCM1fTweIvyPM0KSRxfbQRVGRGXym9bUMzx1laSl0CLXgi9q
fa/lbmZOLG5PovQLe8eNon6aXybWWwuxfh/dhpBWLg5VGb0fXFutH3Cs2MGbX/Ec
6qMvbwO09SGSTrXSQepbhFkAmViBAgUH2kiRhKReDfvRc4OwsbxdlmA5r9Ek6R5L
x7tH8mYwOJVP1OEZWYRX+9GG1n8hCIKKSuRrhMbZtErTSNdf+YhldC6DuJF0zWVE
nsoKzIEl15kIl0akkC6oA3MLNX1sfRh9C2J85Rt+odN4fY4MidrfcqRgWE2X3cuu
CKDsr0Lb6aTtfZASHm7QQRsM4hujWoArBq6ZvUNjpfNXOPe4ovxX9TkPHp6OezuK
v3PRzXQxUtmreK+02ZzalL6IBwAQrmLKxXM2P2Nuh4gDMgFC/BrHMxc1QVSnmb/m
flMuKtvm+fkwKySQvX22FZrzhPfCMAuxCh28WdDSW2pnmZ8H0M3922y45xw3QTpg
SltMtIcpO6ipzMsrVvO/hI/GvByFNN6jcLVGUV1Wx8mNdcf2kPeebA4hINKt5UDh
gpwkz4zb2Bgp/YdgiG8NzBjpk2FMO0IPiAnouPrXenizbesAlKR2V3uFqa70PW/A
BBkHnakS
=VBrL
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for Linux 5.9, take #1
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
Just merge these helpers into the main dma_direct_{alloc,free} routines,
as the additional checks are always false for the two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
If an exception needs to be handled while reading an MSR - which is in
most of the cases caused by a #GP on a non-existent MSR - then this
is most likely the incarnation of a BIOS or a hardware bug. Such bug
violates the architectural guarantee that MCA banks are present with all
MSRs belonging to them.
The proper fix belongs in the hardware/firmware - not in the kernel.
Handling an #MC exception which is raised while an NMI is being handled
would cause the nasty NMI nesting issue because of the shortcoming of
IRET of reenabling NMIs when executed. And the machine is in an #MC
context already so <Deity> be at its side.
Tracing MSR accesses while in #MC is another no-no due to tracing being
inherently a bad idea in atomic context:
vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section
so remove all that "additional" functionality from mce_rdmsrl() and
provide it with a special exception handler which panics the machine
when that MSR is not accessible.
The exception handler prints a human-readable message explaining what
the panic reason is but, what is more, it panics while in the #GP
handler and latter won't have executed an IRET, thus opening the NMI
nesting issue in the case when the #MC has happened while handling
an NMI. (#MC itself won't be reenabled until MCG_STATUS hasn't been
cleared).
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
[ Add missing prototypes for ex_handler_* ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200906212130.GA28456@zn.tnic
x86_64 zero extends 32bit operations, so for 64bit operands,
XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids
a REX prefix byte when legacy registers are used.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
x86_64 zero extends 32bit operations, so for 64bit operands,
XORL r32,r32 is functionally equal to XORL r64,r64, but avoids
a REX prefix byte when legacy registers are used.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Make sure the machine supports RDRAND, otherwise there is no trusted
source of randomness in the system.
To also check this in the pre-decompression stage, make has_cpuflag()
not depend on CONFIG_RANDOMIZE_BASE anymore.
Signed-off-by: Martin Radev <martin.b.radev@gmail.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-73-joro@8bytes.org
Calling down to EFI runtime services can result in the firmware
performing VMGEXIT calls. The firmware is likely to use the GHCB of the
OS (e.g., for setting EFI variables), so each GHCB in the system needs
to be identity-mapped in the EFI page tables, as unencrypted, to avoid
page faults.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Moved GHCB mapping loop to sev-es.c ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-72-joro@8bytes.org
Commit
0c2a3913d6 ("x86/fpu: Parse clearcpuid= as early XSAVE argument")
changed clearcpuid parsing from __setup() to cmdline_find_option().
While the __setup() function would have been called for each clearcpuid=
parameter on the command line, cmdline_find_option() will only return
the last one, so the change effectively made it impossible to disable
more than one bit.
Allow a comma-separated list of bit numbers as the argument for
clearcpuid to allow multiple bits to be disabled again. Log the bits
being disabled for informational purposes.
Also fix the check on the return value of cmdline_find_option(). It
returns -1 when the option is not found, so testing as a boolean is
incorrect.
Fixes: 0c2a3913d6 ("x86/fpu: Parse clearcpuid= as early XSAVE argument")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907213919.2423441-1-nivedita@alum.mit.edu
Unwind hints are useful to provide objtool with information about stack
states in non-standard functions/code.
While the type of information being provided might be very arch
specific, the mechanism to provide the information can be useful for
other architectures.
Move the relevant unwint hint definitions for all architectures to
see.
[ jpoimboe: REGS_IRET -> REGS_PARTIAL ]
Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Header frame.h is getting more code annotations to help objtool analyze
object files.
Rename the file to objtool.h.
[ jpoimboe: add objtool.h to MAINTAINERS ]
Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Compiler is not happy about one function prototype:
CC kernel/dma/swiotlb.o
kernel/dma/swiotlb.c:275:1: warning: no previous prototype for ‘swiotlb_late_init_with_default_size’ [-Wmissing-prototypes]
275 | swiotlb_late_init_with_default_size(size_t default_size)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since it's used outside of the module, move its declaration to the header
from the user.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Stephane Eranian found a bug in that IBS' current Fetch counter was not
being reset when the driver would write the new value to clear it along
with the enable bit set, and found that adding an MSR write that would
first disable IBS Fetch would make IBS Fetch reset its current count.
Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12,
2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when
IbsFetchEn is changed from 0 to 1."
Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch,
so the driver properly resets the internal counter to 0 and IBS
Fetch starts counting again.
A family 15h machine tested does not have this problem, and the extra
wrmsr is also not needed on Family 19h, so only do the extra wrmsr on
families 16h through 18h.
Reported-by: Stephane Eranian <stephane.eranian@google.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
[peterz: optimized]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Family 19h RAPL support did not change from Family 17h; extend
the existing Fam17h support to work on Family 19h too.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200908214740.18097-8-kim.phillips@amd.com
IBS hardware with the OpCntExt feature gets a 7-bit wider internal
counter. Both the maximum and current count bitfields in the
IBS_OP_CTL register are extended to support reading and writing it.
No changes are necessary to the driver for handling the extra
contiguous current count bits (IbsOpCurCnt), as the driver already
passes through 32 bits of that field. However, the driver has to do
some extra bit manipulation when converting from a period to the
non-contiguous (although conveniently aligned) extra bits in the
IbsOpMaxCnt bitfield.
This decreases IBS Op interrupt overhead when the period is over
1,048,560 (0xffff0), which would previously activate the driver's
software counter. That threshold is now 134,217,712 (0x7fffff0).
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200908214740.18097-7-kim.phillips@amd.com
Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode.
Don't accumulate them into raw sample user data in that case.
Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR.
Technically, there is an ABI change here with respect to the IBS raw
sample data format, but I don't see any perf driver version information
being included in perf.data file headers, but, existing users can detect
whether the size of the sample record has reduced by 8 bytes to
determine whether the IBS driver has this fix.
Fixes: 904cb3677f ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions")
Reported-by: Stephane Eranian <stephane.eranian@google.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits
to its count regardless of hardware's valid status.
According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54,
if the counter rolls over, valid status is set, and the lower 7 bits
of IbsOpCurCnt are randomized by hardware.
Don't include those bits in the driver's event count.
Fixes: 8b1e13638d ("perf/x86-ibs: Fix usage of IBS op current count")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Commit 5738891229 ("perf/x86/amd: Add support for Large Increment
per Cycle Events") mistakenly zeroes the upper 16 bits of the count
in set_period(). That's fine for counting with perf stat, but not
sampling with perf record when only Large Increment events are being
sampled. To enable sampling, we sign extend the upper 16 bits of the
merged counter pair as described in the Family 17h PPRs:
"Software wanting to preload a value to a merged counter pair writes the
high-order 16-bit value to the low-order 16 bits of the odd counter and
then writes the low-order 48-bit value to the even counter. Reading the
even counter of the merged counter pair returns the full 64-bit value."
Fixes: 5738891229 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Commit 2f217d58a8 ("perf/x86/amd/uncore: Set the thread mask for
F17h L3 PMCs") inadvertently changed the uncore driver's behaviour
wrt perf tool invocations with or without a CPU list, specified with
-C / --cpu=.
Change the behaviour of the driver to assume the former all-cpu (-a)
case, which is the more commonly desired default. This fixes
'-a -A' invocations without explicit cpu lists (-C) to not count
L3 events only on behalf of the first thread of the first core
in the L3 domain.
BEFORE:
Activity performed by the first thread of the last core (CPU#43) in
CPU#40's L3 domain is not reported by CPU#40:
sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
...
CPU36 21,835 l3_request_g1.caching_l3_cache_accesses
CPU40 87,066 l3_request_g1.caching_l3_cache_accesses
CPU44 17,360 l3_request_g1.caching_l3_cache_accesses
...
AFTER:
The L3 domain activity is now reported by CPU#40:
sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
...
CPU36 354,891 l3_request_g1.caching_l3_cache_accesses
CPU40 1,780,870 l3_request_g1.caching_l3_cache_accesses
CPU44 315,062 l3_request_g1.caching_l3_cache_accesses
...
Fixes: 2f217d58a8 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200908214740.18097-2-kim.phillips@amd.com
A warning as below may be triggered when sampling with large PEBS.
[ 410.411250] perf: interrupt took too long (72145 > 71975), lowering
kernel.perf_event_max_sample_rate to 2000
[ 410.724923] ------------[ cut here ]------------
[ 410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
x86_pmu_stop+0x95/0xa0
[ 410.933811] x86_pmu_del+0x50/0x150
[ 410.937304] event_sched_out.isra.0+0xbc/0x210
[ 410.941751] group_sched_out.part.0+0x53/0xd0
[ 410.946111] ctx_sched_out+0x193/0x270
[ 410.949862] __perf_event_task_sched_out+0x32c/0x890
[ 410.954827] ? set_next_entity+0x98/0x2d0
[ 410.958841] __schedule+0x592/0x9c0
[ 410.962332] schedule+0x5f/0xd0
[ 410.965477] exit_to_usermode_loop+0x73/0x120
[ 410.969837] prepare_exit_to_usermode+0xcd/0xf0
[ 410.974369] ret_from_intr+0x2a/0x3a
[ 410.977946] RIP: 0033:0x40123c
[ 411.079661] ---[ end trace bc83adaea7bb664a ]---
In the non-overflow context, e.g., context switch, with large PEBS, perf
may stop an event twice. An example is below.
//max_samples_per_tick is adjusted to 2
//NMI is triggered
intel_pmu_handle_irq()
handle_pmi_common()
drain_pebs()
__intel_pmu_pebs_event()
perf_event_overflow()
__perf_event_account_interrupt()
hwc->interrupts = 1
return 0
//A context switch happens right after the NMI.
//In the same tick, the perf_throttled_seq is not changed.
perf_event_task_sched_out()
perf_pmu_sched_task()
intel_pmu_drain_pebs_buffer()
__intel_pmu_pebs_event()
perf_event_overflow()
__perf_event_account_interrupt()
++hwc->interrupts >= max_samples_per_tick
return 1
x86_pmu_stop(); # First stop
perf_event_context_sched_out()
task_ctx_sched_out()
ctx_sched_out()
event_sched_out()
x86_pmu_del()
x86_pmu_stop(); # Second stop and trigger the warning
Perf should only invoke the perf_event_overflow() in the overflow
context.
Current drain_pebs() is called from:
- handle_pmi_common() -- overflow context
- intel_pmu_pebs_sched_task() -- non-overflow context
- intel_pmu_pebs_disable() -- non-overflow context
- intel_pmu_auto_reload_read() -- possible overflow context
With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
invoked in the NMI handler. But, before calling the function, the
PEBS buffer has already been drained. The __intel_pmu_pebs_event()
will not be called in the possible overflow context.
To fix the issue, an indicator is required to distinguish between the
overflow context aka handle_pmi_common() and other cases.
The dummy regs pointer can be used as the indicator.
In the non-overflow context, perf should treat the last record the same
as other PEBS records, and doesn't invoke the generic overflow handler.
Fixes: 21509084f9 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
Reported-by: Like Xu <like.xu@linux.intel.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Like Xu <like.xu@linux.intel.com>
Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
Latch sequence counters have unique read and write APIs, and thus
seqcount_latch_t was recently introduced at seqlock.h.
Use that new data type instead of plain seqcount_t. This adds the
necessary type-safety and ensures that only latching-safe seqcount APIs
are to be used.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
[peterz: unwreck cyc2ns_read_begin()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200827114044.11173-7-a.darwish@linutronix.de
When running under SEV-ES, the kernel has to tell the hypervisor when to
open the NMI window again after an NMI was injected. This is done with
an NMI-complete message to the hypervisor.
Add code to the kernel's NMI handler to send this message right at the
beginning of do_nmi(). This always allows nesting NMIs.
[ bp: Mark __sev_es_nmi_complete() noinstr:
vmlinux.o: warning: objtool: exc_nmi()+0x17: call to __sev_es_nmi_complete()
leaves .noinstr.text section
While at it, use __pa_nodebug() for the same reason due to
CONFIG_DEBUG_VIRTUAL=y:
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xd9: call to __phys_addr()
leaves .noinstr.text section ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-71-joro@8bytes.org
Add a play_dead handler when running under SEV-ES. This is needed
because the hypervisor can't deliver an SIPI request to restart the AP.
Instead, the kernel has to issue a VMGEXIT to halt the VCPU until the
hypervisor wakes it up again.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-70-joro@8bytes.org
The APs are not ready to handle exceptions when verify_cpu() is called
in secondary_startup_64().
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-69-joro@8bytes.org
The IDT on 64-bit contains vectors which use paranoid_entry() and/or IST
stacks. To make these vectors work, the TSS and the getcpu GDT entry need
to be set up before the IDT is loaded.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-68-joro@8bytes.org
As part of the GHCB specification, the booting of APs under SEV-ES
requires an AP jump table when transitioning from one layer of code to
another (e.g. when going from UEFI to the OS). As a result, each layer
that parks an AP must provide the physical address of an AP jump table
to the next layer via the hypervisor.
Upon booting of the kernel, read the AP jump table address from the
hypervisor. Under SEV-ES, APs are started using the INIT-SIPI-SIPI
sequence. Before issuing the first SIPI request for an AP, the start
CS and IP is programmed into the AP jump table. Upon issuing the SIPI
request, the AP will awaken and jump to that start CS:IP address.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapted to different code base
- Moved AP table setup from SIPI sending path to
real-mode setup code
- Fix sparse warnings ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-67-joro@8bytes.org
The code at the trampoline entry point is executed in real-mode. In
real-mode, #VC exceptions can't be handled so anything that might cause
such an exception must be avoided.
In the standard trampoline entry code this is the WBINVD instruction and
the call to verify_cpu(), which are both not needed anyway when running
as an SEV-ES guest.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-66-joro@8bytes.org
Implement the callbacks to copy the processor state required by KVM to
the GHCB.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Split out of a larger patch
- Adapt to different callback functions ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-64-joro@8bytes.org
Add two new paravirt callbacks to provide hypervisor-specific processor
state in the GHCB and to copy state from the hypervisor back to the
processor.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-63-joro@8bytes.org
Handle #VC exceptions caused by #DB exceptions in the guest. Those
must be handled outside of instrumentation_begin()/end() so that the
handler will not be raised recursively.
Handle them by calling the kernel's debug exception handler.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-62-joro@8bytes.org
Implement a handler for #VC exceptions caused by #AC exceptions. The
#AC exception is just forwarded to do_alignment_check() and not pushed
down to the hypervisor, as requested by the SEV-ES GHCB Standardization
Specification.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-61-joro@8bytes.org
Implement a handler for #VC exceptions caused by VMMCALL instructions.
This is only a starting point, VMMCALL emulation under SEV-ES needs
further hypervisor-specific changes to provide additional state.
[ bp: Drop "this patch". ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-60-joro@8bytes.org
Implement a handler for #VC exceptions caused by INVD instructions.
Since Linux should never use INVD, just mark it as unsupported.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to #VC handling infrastructure ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-57-joro@8bytes.org
Implement a handler for #VC exceptions caused by RDTSC and RDTSCP
instructions. Also make it available in the pre-decompression stage
because the KASLR code uses RDTSC/RDTSCP to gather entropy and some
hypervisors intercept these instructions.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapt to #VC handling infrastructure
- Make it available early ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-55-joro@8bytes.org
Add code to handle #VC exceptions on DR7 register reads and writes.
This is needed early because show_regs() reads DR7 to print it out.
Under SEV-ES, there is currently no support for saving/restoring the
DRx registers but software expects to be able to write to the DR7
register. For now, cache the value written to DR7 and return it on
read attempts, but do not touch the real hardware DR7.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapt to #VC handling framework
- Support early usage ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-53-joro@8bytes.org
Add handling for emulation of the MOVS instruction on MMIO regions, as
done by the memcpy_toio() and memcpy_fromio() functions.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-51-joro@8bytes.org
Add a handler for #VC exceptions caused by MMIO intercepts. These
intercepts come along as nested page faults on pages with reserved
bits set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-50-joro@8bytes.org
When a #VC exception is triggered by user-space, the instruction decoder
needs to read the instruction bytes from user addresses. Enhance
vc_decode_insn() to safely fetch kernel and user instructions.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-49-joro@8bytes.org
The #VC handler needs special entry code because:
1. It runs on an IST stack
2. It needs to be able to handle nested #VC exceptions
To make this work, the entry code is implemented to pretend it doesn't
use an IST stack. When entered from user-mode or early SYSCALL entry
path it switches to the task stack. If entered from kernel-mode it tries
to switch back to the previous stack in the IRET frame.
The stack found in the IRET frame is validated first, and if it is not
safe to use it for the #VC handler, the code will switch to a
fall-back stack (the #VC2 IST stack). From there, it can cause nested
exceptions again.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-46-joro@8bytes.org
The get_stack_info() functionality is needed in the entry code for the
#VC exception handler. Provide a version of it in the .text.noinstr
section which can be called safely from there.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-45-joro@8bytes.org
When an NMI hits in the #VC handler entry code before it has switched to
another stack, any subsequent #VC exception in the NMI code-path will
overwrite the interrupted #VC handler's stack.
Make sure this doesn't happen by explicitly adjusting the #VC IST entry
in the NMI handler for the time it can cause #VC exceptions.
[ bp: Touchups, spelling fixes. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-44-joro@8bytes.org
Allocate and map an IST stack and an additional fall-back stack for
the #VC handler. The memory for the stacks is allocated only when
SEV-ES is active.
The #VC handler needs to use an IST stack because a #VC exception can be
raised from kernel space with unsafe stack, e.g. in the SYSCALL entry
path.
Since the #VC exception can be nested, the #VC handler switches back to
the interrupted stack when entered from kernel space. If switching back
is not possible, the fall-back stack is used.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-43-joro@8bytes.org
The runtime handler needs one GHCB per-CPU. Set them up and map them
unencrypted.
[ bp: Touchups and simplification. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-42-joro@8bytes.org
Add the infrastructure to handle #VC exceptions when the kernel runs on
virtual addresses and has mapped a GHCB. This handler will be used until
the runtime #VC handler takes over.
Since the handler runs very early, disable instrumentation for sev-es.c.
[ bp: Make vc_ghcb_invalidate() __always_inline so that it can be
inlined in noinstr functions like __sev_es_nmi_complete(). ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200908123816.GB3764@8bytes.org
Setup an early handler for #VC exceptions. There is no GHCB mapped
yet, so just re-use the vc_no_ghcb_handler(). It can only handle
CPUID exit-codes, but that should be enough to get the kernel through
verify_cpu() and __startup_64() until it runs on virtual addresses.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
[ boot failure Error: kernel_ident_mapping_init() failed. ]
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20200908123517.GA3764@8bytes.org
Setup sev-es.c and include the code from the pre-decompression stage
to also build it into the image of the running kernel. Temporarily add
__maybe_unused annotations to avoid build warnings until the functions
get used.
[ bp: Use the non-tracing rd/wrmsr variants because:
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0x11f: \
call to do_trace_write_msr() leaves .noinstr.text section
as __sev_es_nmi_complete() is noinstr due to being called from the
NMI handler exc_nmi(). ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-39-joro@8bytes.org
A recent refresh of the defconfigs got rid of the following
(unset) config:
# CONFIG_64BIT is not set
Innocuous as it seems, when the config file is saved again the
behavior is changed so that CONFIG_64BIT=y.
Currently,
$ make i386_defconfig
$ grep CONFIG_64BIT .config
CONFIG_64BIT=y
whereas previously (and with this patch):
$ make i386_defconfig
$ grep CONFIG_64BIT .config
# CONFIG_64BIT is not set
( This was found with weird compiler errors on OpenEmbedded
builds, as the compiler was unable to cope with 64-bits data
types. )
Fixes: 1d0e12fd3a ("x86/defconfigs: Refresh defconfig files")
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Daniel Díaz <daniel.diaz@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more. To properly
handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
x86 a new alternative is introduced, which just like the one in
entry_64.S has to use the hardcoded virtual address bits to escape
the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
page tables are enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For 64-bit the only thing missing was a strategic _AC, and for 32-bit we
need to use __PAGE_OFFSET instead of PAGE_OFFSET in the TASK_SIZE
definition to escape the explicit unsigned long cast. This just works
because __PAGE_OFFSET is defined using _AC itself and thus never needs
the cast anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
At least for 64-bit this moves them closer to some of the defines
they are based on, and it prepares for using the TASK_SIZE_MAX
definition from assembly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a CONFIG_SET_FS option that is selected by architecturess that
implement set_fs, which is all of them initially. If the option is not
set stubs for routines related to overriding the address space are
provided so that architectures can start to opt out of providing set_fs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Refactor the message printed to the kernel log which indicates whether
SEV or SME, etc is active. This will scale better in the future when
more memory encryption features might be added. Also add SEV-ES to the
list of features.
[ bp: Massage. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-38-joro@8bytes.org
Add a sev_es_active() function for checking whether SEV-ES is enabled.
Also cache the value of MSR_AMD64_SEV at boot to speed up the feature
checking in the running code.
[ bp: Remove "!!" in sev_active() too. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-37-joro@8bytes.org
Move the assembly coded dispatch between page-faults and all other
exceptions to C code to make it easier to maintain and extend.
Also change the return-type of early_make_pgtable() to bool and make it
static.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-36-joro@8bytes.org
Move these two functions from kernel/idt.c to include/asm/desc.h:
* init_idt_data()
* idt_init_desc()
These functions are needed to setup IDT entries very early and need to
be called from head64.c. To be usable this early, these functions need
to be compiled without instrumentation and the stack-protector feature.
These features need to be kept enabled for kernel/idt.c, so head64.c
must use its own versions.
[ bp: Take Kees' suggested patch title and add his Rev-by. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-35-joro@8bytes.org
Add a separate bringup IDT for the CPU bringup code that will be used
until the kernel switches to the idt_table. There are two reasons for a
separate IDT:
1) When the idt_table is set up and the secondary CPUs are
booted, it contains entries (e.g. IST entries) which
require certain CPU state to be set up. This includes a
working TSS (for IST), MSR_GS_BASE (for stack protector) or
CR4.FSGSBASE (for paranoid_entry) path. By using a
dedicated IDT for early boot this state need not to be set
up early.
2) The idt_table is static to idt.c, so any function
using/modifying must be in idt.c too. That means that all
compiler driven instrumentation like tracing or KASAN is
also active in this code. But during early CPU bringup the
environment is not set up for this instrumentation to work
correctly.
To avoid all of these hassles and make early exception handling robust,
use a dedicated bringup IDT.
The IDT is loaded two times, first on the boot CPU while the kernel is
still running on direct mapped addresses, and again later after the
switch to kernel addresses has happened. The second IDT load happens on
the boot and secondary CPUs.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-34-joro@8bytes.org
Make sure there is a stack once the kernel runs from virtual addresses.
At this stage any secondary CPU which boots will have lost its stack
because the kernel switched to a new page-table which does not map the
real-mode stack anymore.
This is needed for handling early #VC exceptions caused by instructions
like CPUID.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-33-joro@8bytes.org
Make sure segments are properly set up before setting up an IDT and
doing anything that might cause a #VC exception. This is later needed
for early exception handling.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-32-joro@8bytes.org
Load the GDT right after switching to virtual addresses to make sure
there is a defined GDT for exception handling.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-31-joro@8bytes.org
Handling exceptions during boot requires a working GDT. The kernel GDT
can't be used on the direct mapping, so load a startup GDT and setup
segments.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-30-joro@8bytes.org
The code to setup idt_data is needed for early exception handling, but
set_intr_gate() can't be used that early because it has pv-ops in its
code path which don't work that early.
Split out the idt_data initialization part from set_intr_gate() so
that it can be used separately.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-29-joro@8bytes.org
The xgetbv() function is needed in the pre-decompression boot code,
but asm/fpu/internal.h can't be included there directly. Doing so
opens the door to include-hell due to various include-magic in
boot/compressed/misc.h.
Avoid that by moving xgetbv()/xsetbv() to a separate header file and
include it instead.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-27-joro@8bytes.org
Force a page-fault on any further accesses to the GHCB page when they
shouldn't happen anymore. This will catch any bugs where a #VC exception
is raised even though none is expected anymore.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-25-joro@8bytes.org
Install an exception handler for #VC exception that uses a GHCB. Also
add the infrastructure for handling different exit-codes by decoding
the instruction that caused the exception and error handling.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-24-joro@8bytes.org
The functions are needed to map the GHCB for SEV-ES guests. The GHCB
is used for communication with the hypervisor, so its content must not
be encrypted. After the GHCB is not needed anymore it must be mapped
encrypted again so that the running kernel image can safely re-use the
memory.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-23-joro@8bytes.org
Call set_sev_encryption_mask() while still on the stage 1 #VC-handler
because the stage 2 handler needs the kernel's own page tables to be
set up, to which calling set_sev_encryption_mask() is a prerequisite.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-21-joro@8bytes.org
Add the first handler for #VC exceptions. At stage 1 there is no GHCB
yet because the kernel might still be running on the EFI page table.
The stage 1 handler is limited to the MSR-based protocol to talk to the
hypervisor and can only support CPUID exit-codes, but that is enough to
get to stage 2.
[ bp: Zap superfluous newlines after rd/wrmsr instruction mnemonics. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-20-joro@8bytes.org
Changing the function to take start and end as parameters instead of
start and size simplifies the callers which don't need to calculate the
size if they already have start and end.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-19-joro@8bytes.org
With the page-fault handler in place, he identity mapping can be built
on-demand. So remove the code which manually creates the mappings and
unexport/remove the functions used for it.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-18-joro@8bytes.org
When booted through startup_64(), the kernel keeps running on the EFI
page table until the KASLR code sets up its own page table. Without
KASLR, the pre-decompression boot code never switches off the EFI page
table. Change that by unconditionally switching to a kernel-controlled
page table after relocation.
This makes sure the kernel can make changes to the mapping when
necessary, for example map pages unencrypted in SEV and SEV-ES guests.
Also, remove the debug_putstr() calls in initialize_identity_maps()
because the function now runs before console_init() is called.
[ bp: Massage commit message. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-17-joro@8bytes.org
Install a page-fault handler to add an identity mapping to addresses
not yet mapped. Also do some checking whether the error code is sane.
This makes non SEV-ES machines use the exception handling
infrastructure in the pre-decompressions boot code too, making it less
likely to break in the future.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-16-joro@8bytes.org
The file contains only code related to identity-mapped page tables.
Rename the file and compile it always in.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-15-joro@8bytes.org
Add code needed to setup an IDT in the early pre-decompression
boot-code. The IDT is loaded first in startup_64, which is after
EfiExitBootServices() has been called, and later reloaded when the
kernel image has been relocated to the end of the decompression area.
This allows to setup different IDT handlers before and after the
relocation.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-14-joro@8bytes.org
The x86-64 ABI defines a red-zone on the stack:
The 128-byte area beyond the location pointed to by %rsp is considered
to be reserved and shall not be modified by signal or interrupt
handlers. Therefore, functions may use this area for temporary data
that is not needed across function calls. In particular, leaf
functions may use this area for their entire stack frame, rather than
adjusting the stack pointer in the prologue and epilogue. This area is
known as the red zone.
This is not compatible with exception handling, because the IRET frame
written by the hardware at the stack pointer and the functions to handle
the exception will overwrite the temporary variables of the interrupted
function, causing undefined behavior. So disable red-zones for the
pre-decompression boot code.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-13-joro@8bytes.org
Add a function to check whether an instruction has a REP prefix.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-12-joro@8bytes.org
Use the shorthand to make it more readable.
No functional changes.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-5-joro@8bytes.org
Add a function to the instruction decoder which returns the pt_regs
offset of the register specified in the reg field of the modrm byte.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-11-joro@8bytes.org
Building a correct GHCB for the hypervisor requires setting valid bits
in the GHCB. Simplify that process by providing accessor functions to
set values and to update the valid bitmap and to check the valid bitmap
in KVM.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-4-joro@8bytes.org
Factor out the code used to decode an instruction with the correct
address and operand sizes to a helper function.
No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-10-joro@8bytes.org
Extend the vmcb_safe_area with SEV-ES fields and add a new
'struct ghcb' which will be used for guest-hypervisor communication.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-3-joro@8bytes.org
Factor out the code to fetch the instruction from user-space to a helper
function.
No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-9-joro@8bytes.org
Do not allocate a vmcb_control_area and a vmcb_save_area on the stack,
as these structures will become larger with future extenstions of
SVM and thus the svm_set_nested_state() function will become a too large
stack frame.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-2-joro@8bytes.org
The inat-tables.c file has some arrays in it that contain pointers to
other arrays. These pointers need to be relocated when the kernel
image is moved to a different location.
The pre-decompression boot-code has no support for applying ELF
relocations, so initialize these arrays at runtime in the
pre-decompression code to make sure all pointers are correctly
initialized.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200907131613.12703-8-joro@8bytes.org
Move the definition of the x86 page-fault error code bits to a new
header file asm/trap_pf.h. This makes it easier to include them into
pre-decompression boot code. No functional changes.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-7-joro@8bytes.org
Add CPU feature detection for Secure Encrypted Virtualization with
Encrypted State. This feature enhances SEV by also encrypting the
guest register state, making it in-accessible to the hypervisor.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-6-joro@8bytes.org
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.
Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
...because future AMD systems will support up to 64 MCA banks per CPU.
MAX_NR_BANKS is used to allocate a number of data structures, and it is
used as a ceiling for values read from MCG_CAP[Count]. Therefore, this
change will have no functional effect on existing systems with 32 or
fewer MCA banks per CPU.
However, this will increase the size of the following structures:
Global bitmaps:
- core.c / mce_banks_ce_disabled
- core.c / all_banks
- core.c / valid_banks
- core.c / toclear
- Total: 32 new bits * 4 bitmaps = 16 new bytes
Per-CPU bitmaps:
- core.c / mce_poll_banks
- intel.c / mce_banks_owned
- Total: 32 new bits * 2 bitmaps = 8 new bytes
The bitmaps are arrays of longs. So this change will only affect 32-bit
execution, since there will be one additional long used. There will be
no additional memory use on 64-bit execution, because the size of long
is 64 bits.
Global structs:
- amd.c / struct smca_bank smca_banks[]: 16 bytes per bank
- core.c / struct mce_bank_dev mce_bank_devs[]: 56 bytes per bank
- Total: 32 new banks * (16 + 56) bytes = 2304 new bytes
Per-CPU structs:
- core.c / struct mce_bank mce_banks_array[]: 16 bytes per bank
- Total: 32 new banks * 16 bytes = 512 new bytes
32-bit
Total global size increase: 2320 bytes
Total per-CPU size increase: 520 bytes
64-bit
Total global size increase: 2304 bytes
Total per-CPU size increase: 512 bytes
This additional memory should still fit within the existing .data
section of the kernel binary. However, in the case where it doesn't
fit, an additional page (4kB) of memory will be added to the binary to
accommodate the extra data which will be the maximum size increase of
vmlinux.
Signed-off-by: Akshay Gupta <Akshay.Gupta@amd.com>
[ Adjust commit message and code comment. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200828192412.320052-1-Yazen.Ghannam@amd.com
Andy reported that the syscall treacing for 32bit fast syscall fails:
# ./tools/testing/selftests/x86/ptrace_syscall_32
...
[RUN] SYSEMU
[FAIL] Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
...
[RUN] SYSCALL
[FAIL] Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
The eason is that the conversion to generic entry code moved the retrieval
of the sixth argument (EBP) after the point where the syscall entry work
runs, i.e. ptrace, seccomp, audit...
Unbreak it by providing a split up version of syscall_enter_from_user_mode().
- syscall_enter_from_user_mode_prepare() establishes state and enables
interrupts
- syscall_enter_from_user_mode_work() runs the entry work
Replace the call to syscall_enter_from_user_mode() in the 32bit fast
syscall C-entry with the split functions and stick the EBP retrieval
between them.
Fixes: 27d6b4d14f ("x86/entry: Use generic syscall entry function")
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87k0xdjbtt.fsf@nanos.tec.linutronix.de
Current usage of thread.debugreg6 is convoluted at best. It starts life as
a copy of the hardware DR6 value, but then various bits are cleared and
set.
Replace this with a new variable thread.virtual_dr6 that is initialized to
0 when DR6 is read and only gains bits, at the same time the actual (on
stack) dr6 value which is read from the hardware only gets bits cleared.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.415372940@infradead.org
DR6 has a whole bunch of bits that have negative polarity; they were
architecturally reserved and defined to be 1 and are now getting used.
Since they're 1 by default, 0 becomes the signal value.
Handle this by xor'ing the read DR6 value by the reserved mask, this
will flip them around such that 1 is the signal value (positive
polarity).
Current Linux doesn't yet support any of these bits, but there's two
defined:
- DR6[11] Bus Lock Debug Exception (ISEr39)
- DR6[16] Restricted Transactional Memory (SDM)
Update ptrace_{set,get}_debugreg() to provide/consume the value in
architectural polarity. Although afaict ptrace_set_debugreg(6) is
pointless, the value is not consumed anywhere.
Change hw_breakpoint_restore() to alway write the DR6_RESERVED value
to DR6, again, no consumer for that write.
Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.354220797@infradead.org
This is called with interrupts disabled, there's no point in using
get_cpu() and per_cpu().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.292906672@infradead.org
Unused remnants for the bit-bucket.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.233022474@infradead.org
Remove the historical junk and replace it with a WARN and a comment.
The problem is that even though the kernel only uses TF single-step in
kprobes and KGDB, both of which consume the event before this, QEMU/KVM has
bugs in this area that can trigger this state so it has to be dealt with.
Suggested-by: Brian Gerst <brgerst@gmail.com>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.170216274@infradead.org
The cond_local_irq_enable() block, dealing with vm86 and sending
signals is only relevant for #DB-from-user, move it there.
This then reduces handle_debug() to only the notifier call, so rename
it to notify_debug().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.094265982@infradead.org
The historical SYSENTER junk is explicitly for from-kernel, so move it
to the #DB-from-kernel handler.
It is ordered after the notifier, which is important for KGDB which uses TF
single-step and needs to consume the event before that point.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133201.031099736@infradead.org
There's no point in calculating si_code if it's not going to be used.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200902133200.967434217@infradead.org
The handle_debug(.user) argument is used to terminate the #DB handler early
for the INT1-from-kernel case, since the kernel doesn't use INT1.
Remove the argument and handle this explicitly in #DB-from-kernel.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.907020598@infradead.org
Kprobes are on kernel text, and thus only matter for #DB-from-kernel.
Kprobes are ordered before the generic notifier, preserve that order.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.847465360@infradead.org
Move the BTF sync near the DR6 load, as this will be the only common
code guaranteed to run on every #DB.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200902133200.786888252@infradead.org
Trying to clear DR7 around a #DB from usermode malfunctions if the tasks
schedules when delivering SIGTRAP.
Rather than trying to define a special no-recursion region, just allow a
single level of recursion. The same mechanism is used for NMI, and it
hasn't caused any problems yet.
Fixes: 9f58fdde95 ("x86/db: Split out dr6/7 handling")
Reported-by: Kyle Huey <me@kylehuey.com>
Debugged-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/8b9bd05f187231df008d48cf818a6a311cbd5c98.1597882384.git.luto@kernel.org
Link: https://lore.kernel.org/r/20200902133200.726584153@infradead.org
The WARN added in commit 3c73b81a91 ("x86/entry, selftests: Further
improve user entry sanity checks") unconditionally triggers on a IVB
machine because it does not support SMAP.
For !SMAP hardware the CLAC/STAC instructions are patched out and thus if
userspace sets AC, it is still have set after entry.
Fixes: 3c73b81a91 ("x86/entry, selftests: Further improve user entry sanity checks")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200902133200.666781610@infradead.org
On i386, the order of parameters passed on regs is eax,edx,and ecx
(as per regparm(3) calling conventions).
Change the mapping in regs_get_kernel_argument(), so that arg1=ax
arg2=dx, and arg3=cx.
Running the selftests testcase kprobes_args_use.tc shows the result
as passed.
Fixes: 3c88ee194c ("x86: ptrace: Add function argument access API")
Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200828113242.GA1424@cosmos
inst.h was included in calling.h solely to instantiate the RDPID macro.
The usage of RDPID was removed in
6a3ea3e68b ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM")
so remove the include.
Fixes: 6a3ea3e68b ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827171735.93825-1-ubizjak@gmail.com
XORL %0,%0 is equivalent to XORQ %0,%0 as both will zero the entire
register. Use XORL %0,%0 for all operand sizes to avoid REX prefix byte
when legacy registers are used and to avoid size prefix byte when 16bit
registers are used.
Zeroing the full register is OK in this use case.
As a result, the size of the .fixup section decreases by 20 bytes.
[ bp: Massage commit message. ]
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20200827180904.96399-1-ubizjak@gmail.com
We found that callers of dma_get_seg_boundary mostly do an ALIGN
with page mask and then do a page shift to get number of pages:
ALIGN(boundary + 1, 1 << shift) >> shift
However, the boundary might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1" or
passing it to ALIGN() would potentially overflow.
According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)
We can simplify the logic here into a helper function doing:
ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1
This patch introduces and applies dma_get_seg_boundary_nr_pages()
as an overflow-free helper for the dma_get_seg_boundary() callers
to get numbers of pages. It also takes care of the NULL dev case
for non-DMA API callers.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Christoph Hellwig <hch@lst.de>
One can not simply remove vmalloc faulting on x86-32. Upstream
commit: 7f0a002b5a ("x86/mm: remove vmalloc faulting")
removed it on x86 alltogether because previously the
arch_sync_kernel_mappings() interface was introduced. This interface
added synchronization of vmalloc/ioremap page-table updates to all
page-tables in the system at creation time and was thought to make
vmalloc faulting obsolete.
But that assumption was incredibly naive.
It turned out that there is a race window between the time the vmalloc
or ioremap code establishes a mapping and the time it synchronizes
this change to other page-tables in the system.
During this race window another CPU or thread can establish a vmalloc
mapping which uses the same intermediate page-table entries (e.g. PMD
or PUD) and does no synchronization in the end, because it found all
necessary mappings already present in the kernel reference page-table.
But when these intermediate page-table entries are not yet
synchronized, the other CPU or thread will continue with a vmalloc
address that is not yet mapped in the page-table it currently uses,
causing an unhandled page fault and oops like below:
BUG: unable to handle page fault for address: fe80c000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pde = 33183067 *pte = a8648163
Oops: 0002 [#1] SMP
CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G
...
Call Trace:
ldt_dup_context+0x66/0x80
dup_mm+0x2b3/0x480
copy_process+0x133b/0x15c0
_do_fork+0x94/0x3e0
__ia32_sys_clone+0x67/0x80
__do_fast_syscall_32+0x3f/0x70
do_fast_syscall_32+0x29/0x60
do_SYSENTER_32+0x15/0x20
entry_SYSENTER_32+0x9f/0xf2
EIP: 0xb7eef549
So the arch_sync_kernel_mappings() interface is racy, but removing it
would mean to re-introduce the vmalloc_sync_all() interface, which is
even more awful. Keep arch_sync_kernel_mappings() in place and catch
the race condition in the page-fault handler instead.
Do a partial revert of above commit to get vmalloc faulting on x86-32
back in place.
Fixes: 7f0a002b5a ("x86/mm: remove vmalloc faulting")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200902155904.17544-1-joro@8bytes.org
When CONFIG_RETPOLINE is disabled, Clang uses a jump table for the
switch statement in cmdline_find_option (jump tables are disabled when
CONFIG_RETPOLINE is enabled). This function is called very early in boot
from sme_enable() if CONFIG_AMD_MEM_ENCRYPT is enabled. At this time,
the kernel is still executing out of the identity mapping, but the jump
table will contain virtual addresses.
Fix this by disabling jump tables for cmdline.c when AMD_MEM_ENCRYPT is
enabled.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200903023056.3914690-1-nivedita@alum.mit.edu
We don't want to depend on the linker's orphan section placement
heuristics as these can vary between linkers, and may change between
versions. All sections need to be explicitly handled in the linker
script.
Now that all sections are explicitly handled, enable orphan section
warnings.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20200902025347.2504702-6-keescook@chromium.org
We don't want to depend on the linker's orphan section placement
heuristics as these can vary between linkers, and may change between
versions. All sections need to be explicitly handled in the linker script.
Now that all sections are explicitly handled, enable orphan section
warnings.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20200902025347.2504702-5-keescook@chromium.org
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-09-01
The following pull-request contains BPF updates for your *net-next* tree.
There are two small conflicts when pulling, resolve as follows:
1) Merge conflict in tools/lib/bpf/libbpf.c between 88a8212028 ("libbpf: Factor
out common ELF operations and improve logging") in bpf-next and 1e891e513e
("libbpf: Fix map index used in error message") in net-next. Resolve by taking
the hunk in bpf-next:
[...]
scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
data = elf_sec_data(obj, scn);
if (!scn || !data) {
pr_warn("elf: failed to get %s map definitions for %s\n",
MAPS_ELF_SEC, obj->path);
return -EINVAL;
}
[...]
2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
9647c57b11 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
better performance") in bpf-next and e20f0dbf20 ("net/mlx5e: RX, Add a prefetch
command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:
[...]
xdp_set_data_meta_invalid(xdp);
xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
net_prefetch(xdp->data);
[...]
We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).
The main changes are:
1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
for tracing to reliably access user memory, from Alexei Starovoitov.
2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.
3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.
4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.
5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.
6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.
7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.
8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.
9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.
10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.
11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.
12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.
13) Various smaller cleanups and improvements all over the place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix build error when CONFIG_ACPI is not set/enabled by adding the header
file <asm/acpi.h> which contains a stub for the function in the build
error.
../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’:
../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration]
acpi_noirq_set();
Fixes: a912a7584e ("x86/platform/intel-mid: Move PCI initialization to arch_init()")
Link: https://lore.kernel.org/r/ea903917-e51b-4cc9-2680-bc1e36efa026@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Jesse Barnes <jsbarnes@google.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # v4.16+
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
In preparation for warning on orphan sections, stop the linker from
generating the .eh_frame* sections, discard unwanted non-zero-sized
generated sections, and enforce other expected-to-be-zero-sized sections
(since discarding them might hide problems with them suddenly gaining
unexpected entries).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200821194310.3089815-28-keescook@chromium.org
In preparation for warning on orphan sections, enforce other
expected-to-be-zero-sized sections (since discarding them might hide
problems with them suddenly gaining unexpected entries).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200821194310.3089815-25-keescook@chromium.org
The .got.plt section should always be zero (or filled only with the
linker-generated lazy dispatch entry). Enforce this with an assert and
mark the section as INFO. This is more sensitive than just blindly
discarding the section.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200821194310.3089815-24-keescook@chromium.org
When !CONFIG_KPROBES, do not generate kprobe sections. This makes
sure there are no unexpected sections encountered by the linker scripts.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200821194310.3089815-23-keescook@chromium.org
Replace many of the indirect calls with static_call().
The average PMI time, as measured by perf_sample_event_took()*:
PRE: 3283.03 [ns]
POST: 3145.12 [ns]
Which is a ~138 [ns] win per PMI, or a ~4.2% decrease.
[*] on an IVB-EP, using: 'perf record -a -e cycles -- make O=defconfig-build/ -j80'
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135805.338001015@infradead.org
In order to use static_call() to wire up x86_pmu, we need to
initialize earlier, specifically before memory allocation works; copy
some of the tricks from jump_label to enable this.
Primarily we overload key->next to store a sites pointer when there
are no modules, this avoids having to use kmalloc() to initialize the
sites and allows us to run much earlier.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200818135805.220737930@infradead.org
Verify the text we're about to change is as we expect it to be.
Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200818135805.161974981@infradead.org
GCC can turn our static_call(name)(args...) into a tail call, in which
case we get a JMP.d32 into the trampoline (which then does a further
tail-call).
Teach objtool to recognise and mark these in .static_call_sites and
adjust the code patching to deal with this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135805.101186767@infradead.org
Extend the static_call infrastructure to optimize the following common
pattern:
if (func_ptr)
func_ptr(args...)
For the trampoline (which is in effect a tail-call), we patch the
JMP.d32 into a RET, which then directly consumes the trampoline call.
For the in-line sites we replace the CALL with a NOP5.
NOTE: this is 'obviously' limited to functions with a 'void' return type.
NOTE: DEFINE_STATIC_COND_CALL() only requires a typename, as opposed
to a full function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135805.042977182@infradead.org
Future patches will need to poke a RET instruction, provide the
infrastructure required for this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20200818135804.982214828@infradead.org
Add the inline static call implementation for x86-64. The generated code
is identical to the out-of-line case, except we move the trampoline into
it's own section.
Objtool uses the trampoline naming convention to detect all the call
sites. It then annotates those call sites in the .static_call_sites
section.
During boot (and module init), the call sites are patched to call
directly into the destination function. The temporary trampoline is
then no longer used.
[peterz: merged trampolines, put trampoline in section]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135804.864271425@infradead.org
Add the x86 out-of-line static call implementation. For each key, a
permanent trampoline is created which is the destination for all static
calls for the given key. The trampoline has a direct jump which gets
patched by static_call_update() when the destination function changes.
[peterz: fixed trampoline, rewrote patching code]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135804.804315175@infradead.org
Similar to how we disallow kprobes on any other dynamic text
(ftrace/jump_label) also disallow kprobes on inline static_call()s.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200818135804.744920586@infradead.org
The .comment section doesn't belong in STABS_DEBUG. Split it out into a
new macro named ELF_DETAILS. This will gain other non-debug sections
that need to be accounted for when linking with --orphan-handling=warn.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20200821194310.3089815-5-keescook@chromium.org
TSX suspend load tracking instruction is supported by the Intel uarch
Sapphire Rapids. It aims to give a way to choose which memory accesses
do not need to be tracked in the TSX read set. It's availability is
indicated as CPUID.(EAX=7,ECX=0):EDX[bit 16].
Expose TSX Suspend Load Address Tracking feature in KVM CPUID, so KVM
could pass this information to guests and they can make use of this
feature accordingly.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1598316478-23337-3-git-send-email-cathy.zhang@intel.com
- Move disabling of the local APIC after invoking fixup_irqs() to ensure
that interrupts which are incoming are noted in the IRR and not ignored.
- Unbreak affinity setting. The rework of the entry code reused the
regular exception entry code for device interrupts. The vector number is
pushed into the errorcode slot on the stack which is then lifted into an
argument and set to -1 because that's regs->orig_ax which is used in
quite some places to check whether the entry came from a syscall. But it
was overlooked that orig_ax is used in the affinity cleanup code to
validate whether the interrupt has arrived on the new target. It turned
out that this vector check is pointless because interrupts are never
moved from one vector to another on the same CPU. That check is a
historical leftover from the time where x86 supported multi-CPU
affinities, but not longer needed with the now strict single CPU
affinity. Famous last words ...
- Add a missing check for an empty cpumask into the matrix allocator. The
affinity change added a warning to catch the case where an interrupt is
moved on the same CPU to a different vector. This triggers because a
condition with an empty cpumask returns an assignment from the allocator
as the allocator uses for_each_cpu() without checking the cpumask for
being empty. The historical inconsistent for_each_cpu() behaviour of
ignoring the cpumask and unconditionally claiming that CPU0 is in the
mask striked again. Sigh.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9L6WYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRV5D/9dRq/4pn5g1esnzm4GhIr2To3Qp6cl
s7VswTdN8FmWBqVz79ZVYqj663UpL3pPY1np01ctrxRQLeDVfWcI2BMR5irnny8h
otORhFysuDUl+yfuomWVbzfQQNJ+VeQVeWKD3cIhD1I3sXqDX5Wpa8n086hYKQXx
eutVC3+JdzJZFm68xarlLW7h2f1au1eZZFgVnyY+J5KO9Dwm63a4RITdDVk7KV4t
uKEDza5P9SY+kE9LAGNq8BAEObf9FeMXw0mRM7atRKVsJQQGVk6bgiuaRr01w1+W
hQCPx/3g6PHFnGgx/KQgHf1jgrZFhXOyIDo6ZeFy+SJGIZRB3n8o5Kjns2l8Pa+K
2qy1TRoZIsGkwGCi/BM6viLzBikbh/gnGYy/8KTEJdKs8P3ZKHUZVSAB1dpapOWX
4n+rKoVPnvxgRSeZZo+tgLkvUdh+/9Huyr9vHiYjtbbB8tFvjlkOmrZ6sirHByDy
jg6TjOJVb1CC/PoW4M7JNfmeKvHQnTACwH6djdVGDLPJspuUsYkPI0Uk0CX21SA3
45Tuylvl9jT6+vq95Av2RbAiipmSpZ/O1NHV8Paf466SKmhUgG3lv5PHh3xTm1U2
Be/RbJ75x4Muuw42ttU1LcpcLPcOZRQNEREoNd5UysgYYgWRekBvU+ZRQNW4g2nw
3JDgJgm0iBUN9w==
=zIi4
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Three interrupt related fixes for X86:
- Move disabling of the local APIC after invoking fixup_irqs() to
ensure that interrupts which are incoming are noted in the IRR and
not ignored.
- Unbreak affinity setting.
The rework of the entry code reused the regular exception entry
code for device interrupts. The vector number is pushed into the
errorcode slot on the stack which is then lifted into an argument
and set to -1 because that's regs->orig_ax which is used in quite
some places to check whether the entry came from a syscall.
But it was overlooked that orig_ax is used in the affinity cleanup
code to validate whether the interrupt has arrived on the new
target. It turned out that this vector check is pointless because
interrupts are never moved from one vector to another on the same
CPU. That check is a historical leftover from the time where x86
supported multi-CPU affinities, but not longer needed with the now
strict single CPU affinity. Famous last words ...
- Add a missing check for an empty cpumask into the matrix allocator.
The affinity change added a warning to catch the case where an
interrupt is moved on the same CPU to a different vector. This
triggers because a condition with an empty cpumask returns an
assignment from the allocator as the allocator uses for_each_cpu()
without checking the cpumask for being empty. The historical
inconsistent for_each_cpu() behaviour of ignoring the cpumask and
unconditionally claiming that CPU0 is in the mask struck again.
Sigh.
plus a new entry into the MAINTAINER file for the HPE/UV platform"
* tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/matrix: Deal with the sillyness of for_each_cpu() on UP
x86/irq: Unbreak interrupt affinity setting
x86/hotplug: Silence APIC only after all interrupts are migrated
MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers
- Prevent recursion by using raw_cpu_* operations
- Fixup the interrupt state in the cpu idle code to be consistent
- Push rcu_idle_enter/exit() invocations deeper into the idle path so
that the lock operations are inside the RCU watching sections
- Move trace_cpu_idle() into generic code so it's called before RCU goes
idle.
- Handle raw_local_irq* vs. local_irq* operations correctly
- Move the tracepoints out from under the lockdep recursion handling
which turned out to be fragile and inconsistent.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9L5qETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV/NEADG+h02tj2I4gP7IQ3nVodEzS1+odPI
orabY5ggH0kn4YIhPB4UtOd5zKZjr3FJs9wEhyhQpV6ZhvFfgaIKiYqfg+Q81aMO
/BXrfh6jBD2Hu7gaPBnVdkKeh1ehl+w0PhTeJhPBHEEvbGeLUYWwyPNlaKz//VQl
XCWl7e7o/Uw2UyJ469SCx3z+M2DMNqwdMys/zcqvTLiBdLNCwp4TW5ACzEA0rfHh
Pepu3eIKnMURyt82QanrOATvT2io9pOOaUh59zeKi2WM8ikwKd/Eho2kXYng6GvM
GzX4Kn13MsNobZXf9BhqEGICdRkaJqLsXlmBNmbJdSTCn5W2lLZqu2wCEp5VZHCc
XwMbey8ek+BRskJMqAV4oq2GA8Om9KEYWOOdixyOG0UJCiW5qDowuDYBXTLV7FWj
XhzLGuHpUF9eKLKokJ7ideLaDcpzwYjHr58pFLQrqPwmjVKWguLeYMg5BhhTiEuV
wNfiLIGdMNsCpYKhnce3o9paV8+hy1ZveWhNy+/4HaDLoEwI2T62i8R7xxbrcWMg
sgdAiQG+kVLwSJ13bN+Cz79uLYTIbqGaZHtOXmeIT3jSxBjx5RlXfzocwTHSYrNk
GuLYHd7+QaemN49Rrf4bPR16Db7ifL32QkUtLBTBLcnos9jM+fcl+BWyqYRxhgDv
xzDS+vfK8DvRiA==
=Hgt6
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"A set of fixes for lockdep, tracing and RCU:
- Prevent recursion by using raw_cpu_* operations
- Fixup the interrupt state in the cpu idle code to be consistent
- Push rcu_idle_enter/exit() invocations deeper into the idle path so
that the lock operations are inside the RCU watching sections
- Move trace_cpu_idle() into generic code so it's called before RCU
goes idle.
- Handle raw_local_irq* vs. local_irq* operations correctly
- Move the tracepoints out from under the lockdep recursion handling
which turned out to be fragile and inconsistent"
* tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep,trace: Expose tracepoints
lockdep: Only trace IRQ edges
mips: Implement arch_irqs_disabled()
arm64: Implement arch_irqs_disabled()
nds32: Implement arch_irqs_disabled()
locking/lockdep: Cleanup
x86/entry: Remove unused THUNKs
cpuidle: Move trace_cpu_idle() into generic code
cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic
sched,idle,rcu: Push rcu_idle deeper into the idle path
cpuidle: Fixup IRQ state
lockdep: Use raw_cpu_*() for per-cpu variables
Intel TSX suspend load tracking instructions aim to give a way to choose
which memory accesses do not need to be tracked in the TSX read set. Add
TSX suspend load tracking CPUID feature flag TSXLDTRK for enumeration.
A processor supports Intel TSX suspend load address tracking if
CPUID.0x07.0x0:EDX[16] is present. Two instructions XSUSLDTRK, XRESLDTRK
are available when this feature is present.
The CPU feature flag is shown as "tsxldtrk" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1598316478-23337-2-git-send-email-cathy.zhang@intel.com
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.
The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.
There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.
When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();
Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.
This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
Remove asm/io_apic.h which is included more than once.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Link: https://lkml.kernel.org/r/20200819112910.7629-1-wanghai38@huawei.com
Several people reported that 5.8 broke the interrupt affinity setting
mechanism.
The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.
The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.
The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.
But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.
Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.
That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.
There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.
Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.
Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.
The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.
For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.
Fixes: 633260fa14 ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: Alex bykov <alex.bykov@scylladb.com>
Reported-by: Avi Kivity <avi@scylladb.com>
Reported-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de
There is a race when taking a CPU offline. Current code looks like this:
native_cpu_disable()
{
...
apic_soft_disable();
/*
* Any existing set bits for pending interrupt to
* this CPU are preserved and will be sent via IPI
* to another CPU by fixup_irqs().
*/
cpu_disable_common();
{
....
/*
* Race window happens here. Once local APIC has been
* disabled any new interrupts from the device to
* the old CPU are lost
*/
fixup_irqs(); // Too late to capture anything in IRR.
...
}
}
The fix is to disable the APIC *after* cpu_disable_common().
Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.
Fixes: 60dcaad573 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
A long time ago, Linux cleared IA32_MCG_STATUS at the very end of machine
check processing.
Then, some fancy recovery and IST manipulation was added in:
d4812e169d ("x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks")
and clearing IA32_MCG_STATUS was pulled earlier in the function.
Next change moved the actual recovery out of do_machine_check() and
just used task_work_add() to schedule it later (before returning to the
user):
5567d11c21 ("x86/mce: Send #MC singal from task work")
Most recently the fancy IST footwork was removed as no longer needed:
b052df3da8 ("x86/entry: Get rid of ist_begin/end_non_atomic()")
At this point there is no reason remaining to clear IA32_MCG_STATUS early.
It can move back to the very end of the function.
Also move sync_core(). The comments for this function say that it should
only be called when instructions have been changed/re-mapped. Recovery
for an instruction fetch may change the physical address. But that
doesn't happen until the scheduled work runs (which could be on another
CPU).
[ bp: Massage commit message. ]
Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200824221237.5397-1-tony.luck@intel.com
Early Intel hardware implementations of Memory Bandwidth Allocation (MBA)
could only control bandwidth at the processor core level. This meant that
when two processes with different bandwidth allocations ran simultaneously
on the same core the hardware had to resolve this difference. It did so by
applying the higher throttling value (lower bandwidth) to both processes.
Newer implementations can apply different throttling values to each
thread on a core.
Introduce a new resctrl file, "thread_throttle_mode", on Intel systems
that shows to the user how throttling values are allocated, per-core or
per-thread.
On systems that support per-core throttling, the file will display "max".
On newer systems that support per-thread throttling, the file will display
"per-thread".
AMD confirmed in [1] that AMD bandwidth allocation is already at thread
level but that the AMD implementation does not use a memory delay
throttle mode. So to avoid confusion the thread throttling mode would be
UNDEFINED on AMD systems and the "thread_throttle_mode" file will not be
visible.
Originally-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1598296281-127595-3-git-send-email-fenghua.yu@intel.com
Link: [1] https://lore.kernel.org/lkml/18d277fd-6523-319c-d560-66b63ff606b8@amd.com
Some systems support per-thread Memory Bandwidth Allocation (MBA) which
applies a throttling delay value to each hardware thread instead of to
a core. Per-thread MBA is enumerated by CPUID.
No feature flag is shown in /proc/cpuinfo. User applications need to
check a resctrl throttling mode info file to know if the feature is
supported.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/1598296281-127595-2-git-send-email-fenghua.yu@intel.com
Unused remnants
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org
Remove trace_cpu_idle() from the arch_cpu_idle() implementations and
put it in the generic code, right before disabling RCU. Gets rid of
more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.428433395@infradead.org
This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
Use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu()
as it also checkes if the right lock is held.
Using hlist_for_each_entry_rcu() with a condition argument will not
report the cases where a SRCU protected list is traversed using
rcu_read_lock(). Hence, use hlist_for_each_entry_srcu().
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: <kvm@vger.kernel.org>
ptrace and prctl() are not really fast paths to warrant the use of
static_cpu_has() and cause alternatives patching for no good reason.
Replace with boot_cpu_has() which is simple and fast enough.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200818103715.32736-1-bp@alien8.de
path and unconditionally uses LSL to retrieve the CPU number. RDPID depends
on MSR_TSX_AUX. KVM has an optmization to avoid expensive MRS read/writes
on VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user
space. MSR_TSX_AUX is part of that lazy MSR set, so after writing the guest
value and before the lazy restore any exception using the paranoid entry
will read the guest value and use it as CPU number to retrieve the GSBASE
value for the current CPU when FSGSBASE is enabled. As RDPID is only used
in that particular entry path, there is no reason to burden VMENTER/EXIT
with two extra MSR writes. Remove the RDPID optimization, which is not even
backed by numbers from the paranoid entry path instead.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9CJqgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaAvD/sHkSx1V0hupUh9budLhSPhUyWvXiar
AmAvWl8dKEFG2UOhOT57zD6DgFN3uu5rqjxTG7ha9XEspsVaP5jOji4evus37IAe
z2EB9J5c3ih4VSdaM+8ODTMls5rjQrvJjeDV0ETTQ9Xb+FOT0vNAub6D5PVms6J3
FObDKiIpdB3s5INAWPknIYves0EJP5BP3+gOktovMStfcy8tm8N9S/yA4cNL4nbi
IZ0h1H6xcGrQ79dv+2/vC0cdqlbm6y2KWfNKpSTGwPMdRl0PpHrovsOVPKF+6pio
Ad230t7xnWMlrHghbSvDyyJ67/N6AA6CaqHECWtgsDuzbqcD2MoQ2l97atoZInh7
83n8ZWFaw10T4ksw9SWqAex+ZJh6/rD4vcQYUncPN66/kOVM186ezICc+QsPV99s
ukw29xge4uHz91Hy0Bo8SP+w1bvntKJn6XyJuTFgDt8bmFRIeajSxyOGw7hTs+ZD
TONw9dMeteWZhZRIXYDjlYc83xFYGkX6hmxLrDJ4jg8UGojaca83s7oZtxgZxWzu
L9wfCRJIEA33ihvqtbTEOHbJvl6eyDt8b/kBGGHbAbqweQ3mWEH3WDQ5cyhlLrfA
tnDToX1DvsxfVg94saprNt249qHNZlFIj8EaGfjxxEngd8xAgfU0vxJpI8sBtdsw
SMIwyLLbLz573A==
=AEel
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for x86 which removes the RDPID usage from the paranoid
entry path and unconditionally uses LSL to retrieve the CPU number.
RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid
expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values
and restores them either when leaving the run loop, on preemption or
when going out to user space. MSR_TSX_AUX is part of that lazy MSR
set, so after writing the guest value and before the lazy restore any
exception using the paranoid entry will read the guest value and use
it as CPU number to retrieve the GSBASE value for the current CPU when
FSGSBASE is enabled. As RDPID is only used in that particular entry
path, there is no reason to burden VMENTER/EXIT with two extra MSR
writes. Remove the RDPID optimization, which is not even backed by
numbers from the paranoid entry path instead"
* tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
broken down bandwith counters.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9CJSkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUmoD/9xdmi6Pi8cqsN61zoqzDMygyd45rYH
x9oe1QOYHPBp7fIEv0EeG/HcZpBG23Oyj7XnAGZlUuSzHzRYSK3ceQHqo0VEP7qv
ZrPzBJCI+wx/zVBFN62IbSZmZF3omBjo+TKT1bTKmDRov+qovAEcD040Vt45SkW2
NKiD/u2iJemLrF9g7KKgNCv+alqGAeH6YzZpWgFycP550psIF3WgIY2eMjSY9+XD
i7DrEg/OWY5vvCYbHBxrbhvqkXlhyMzw0jqWJtRHTIPRjMuDCM9yef5KiCExPdFX
e+rE+O2gp/9xMEcAvi4SK3s1QUe0wjq+e7stNiOKrBhNA0CNhvJxAmIBO2y4Bi5+
BKJelWt0vDDxCiQgW9mctq0rvu8KM6C3w6AgKWiQNbirVLKHNdx4AvF5SNWGVRhf
ZGMVaFlpwuuTZfahbdmwi88j6968h8izjkQeQ2gZMXEOgv8P/Df+C9IcHH/HxEgI
OwW6htpU3LeOSbEu9201AJqjqAtpXxYzodBWuGWLF7+BF6NACciITQPApkYvWGPN
bKol/kbX52ZRzBEEUHWfrIred7KPH/2UZltPz6Fo5wwuuI2zKBYecIN/vWppzITH
uj5sq2UVAdRQy8/9wKys1bu0uVhlILVa/KTYuf+9cuCWGELfRTN5iHaKviUEEFWB
aVfI8o2ThyQyvQ==
=vF1u
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fix from Thomas Gleixner:
"A single update for perf on x86 which has support for the broken down
bandwith counters"
* tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
- Enforce NX on RO data in mixed EFI mode
- Destroy workqueue in an error handling path to prevent UAF
- Stop argument parser at '--' which is the delimiter for init
- Treat a NULL command line pointer as empty instead of dereferncing it
unconditionally.
- Handle an unterminated command line correctly
- Cleanup the 32bit code leftovers and remove obsolete documentation
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9CJMATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXDXD/4ypNf6J1B7VX/eSO8IRW5LepYaatUb
nOYESLCFF9RcWV4ff1ujs1Wow8rmTV0U+uCb6auSAMvmFu53VA5uJ9U8lHLjx/cZ
2qFpz6PScXr+cv4oPZ/8T4i7zuT887He1jtwsc+ABRHDUu9yd+qt0IZxqhLAqGQW
dBal0WdZIfMftR1IBZvdbp8hNqHU/sEnVYJOovf2PpKNe8Z+IR2eeQVUoryM7V1b
VNL+KtfpsYAbG5Gr4QVB18tHLnfEEJZHh8Dt7TlTOMF8hFJ6lAE7MutydWqHDMVv
DlXtbs07C1aFIlpJuRdiwKHvNh4/9cUG0XgPqXW44NAXKK2yynV33oidlu9xjLNA
uuFZ7Ni3XyHYgI4PyNX81tC4tG/krIRi5282KRh1OEUc7zgJvHRXo8cCy4HLgpEM
VU4R7bI3LLBJgAdEHh/4EXNnryXOdk2ATUMcLgWGH54ZvnKRCFo1soNggzmiZb+2
WXVRHbSS3nsc0udmXmoUSqqWkud29o9r3KnuC/2qomCXMwtVSTEamTiok+86ICfc
BiNZRu3DeoZqovyssBuSEeNGazrLsg8cGR0lhlpkAf0nu12y+quawdOj13SB8QcH
XGrsCEmXD4jE9cpHt/+qIipyRzFt8RcamJCMH9m5ZNhx1GS6ddunxZAZFD2Fbg/9
+CSclYiPCfH0cw==
=+nbs
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
- Enforce NX on RO data in mixed EFI mode
- Destroy workqueue in an error handling path to prevent UAF
- Stop argument parser at '--' which is the delimiter for init
- Treat a NULL command line pointer as empty instead of dereferncing it
unconditionally.
- Handle an unterminated command line correctly
- Cleanup the 32bit code leftovers and remove obsolete documentation
* tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: efi: remove description of efi=old_map
efi/x86: Move 32-bit code into efi_32.c
efi/libstub: Handle unterminated cmdline
efi/libstub: Handle NULL cmdline
efi/libstub: Stop parsing arguments at "--"
efi: add missed destroy_workqueue when efisubsys_init fails
efi/x86: Mark kernel rodata non-executable for mixed mode
* selftests fix for new binutils
* MMU notifier fix for arm64
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9ARnoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP2YAf/dgLrPm4y4jxm7Aiz3/txqrHEwogT
ZtvnzqUPb6+vkFrkop8QMOPw7A8NCfkn3/6sWbyUN5ObgOG1pxKyPraeN3ZdsDoR
KGwv6P0dKgI8B4UuGEMe9GazXv+oOv8+bSUJnE+HZiUHzJKlX4HJbxDwUhvSSatY
qYCZb/Uzqundh79TYULa7oI1/3F15A2J1zQPe4QgkToH9tsVB8PVfkH5uPJPp64M
DTm5+qgwwsBULFaAuuo3FTs9f3pWJxn8GOuico1Sm+RnR53mhbUJggUfFzP0rwzZ
Emevunje5r1rluFs+JWeNtflGH0gI4CLak7jvlOOBjrNb5XJgUSbzLXxkA==
=Jwic
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- PAE and PKU bugfixes for x86
- selftests fix for new binutils
- MMU notifier fix for arm64
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode
kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode
KVM: x86: fix access code passed to gva_to_gpa
selftests: kvm: Use a shorter encoding to clear RAX
In many cases, task_struct.comm isn't enough to distinguish the
offender, since for interpreted languages it's likely just going to be
"python3" or whatever. Add the pid to make it unambiguous.
[ bp: Make the printk string a single line for easier grepping. ]
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/6f6fbd0ee6c99bc5e47910db700a6642159db01b.1598011595.git.chris@chrisdown.name
Applications which manipulate MSRs from userspace often do so
infrequently, and all at once. As such, the default printk ratelimit
architecture supplied by pr_err_ratelimited() doesn't do enough to prevent
kmsg becoming completely overwhelmed with their messages and pushing
other salient information out of the circular buffer.
In one case, I saw over 80% of kmsg being filled with these messages,
and the default kmsg buffer being completely filled less than 5 minutes
after boot(!).
Make things much less aggressive, while still achieving the original
goal of fiter_write(). Operators will still get warnings that MSRs are
being manipulated from userspace, but they won't have other also
potentially useful messages pushed out of the kmsg buffer.
Of course, one can boot with `allow_writes=1` to avoid these messages at
all, but that then has the downfall that one doesn't get _any_
notification at all about these problems in the first place, and so is
much less likely to forget to fix it.
One might rather it was less binary: it was still logged, just less
often, so that application developers _do_ have the incentive to improve
their current methods, without the kernel having to push other useful
stuff out of the kmsg buffer.
This one example isn't the point, of course: I'm sure there are plenty
of other non-ideal-but-pragmatic cases where people are writing to MSRs
from userspace right now, and it will take time for those people to find
other solutions.
Overall, keep the intent of the original patch, while mitigating its
sometimes heavy effects on kmsg composition.
[ bp: Massage a bit. ]
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/563994ef132ce6cffd28fc659254ca37d032b5ef.1598011595.git.chris@chrisdown.name
The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.
Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.
Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-2-will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX0ALjQAKCRCAXGG7T9hj
voZ2AQCXYVDclEXoNwkD6sS0RuVSc4T6ypEzeGM6tP4Z/VInaAD/eJ8zP+aJx9wL
oeTPsOEJzAax6Xj/c+tQ7maCQdxFpQQ=
=K3yj
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"One build fix and a minor fix for suppressing a useless warning when
booting a Xen dom0 via UEFI"
* tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Fix build error when CONFIG_ACPI is not set/enabled:
efi: avoid error message when booting under Xen
KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.
The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.
Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.
If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.
As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.
As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.
The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.
[ tglx: Rewrote changelog ]
Fixes: eaad981291 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Debugged-by: Tom Lendacky <thomas.lendacky@amd.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe
support for steal-time. However this is unnecessary, as only a KVM
fd is required, and it complicates userspace (userspace may prefer
delaying vcpu creation until after feature probing). Introduce a cap
that can be checked instead. While x86 can already probe steal-time
support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there
too for consistency.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20200804170604.42662-7-drjones@redhat.com
Current minimum required version of binutils is 2.23,
which supports CRC32 instruction mnemonic.
Replace the byte-wise specification of CRC32 with this proper mnemonic.
The compiler is now able to pass memory operand to the instruction,
so there is no need for a temporary register anymore.
Some examples of the improvement:
12a: 48 8b 08 mov (%rax),%rcx
12d: f2 48 0f 38 f1 f1 crc32q %rcx,%rsi
133: 48 83 c0 08 add $0x8,%rax
137: 48 39 d0 cmp %rdx,%rax
13a: 75 ee jne 12a <crc32c_intel_update+0x1a>
to:
125: f2 48 0f 38 f1 06 crc32q (%rsi),%rax
12b: 48 83 c6 08 add $0x8,%rsi
12f: 48 39 d6 cmp %rdx,%rsi
132: 75 f1 jne 125 <crc32c_intel_update+0x15>
and:
146: 0f b6 08 movzbl (%rax),%ecx
149: f2 0f 38 f0 f1 crc32b %cl,%esi
14e: 48 83 c0 01 add $0x1,%rax
152: 48 39 d0 cmp %rdx,%rax
155: 75 ef jne 146 <crc32c_intel_update+0x36>
to:
13b: f2 0f 38 f0 02 crc32b (%rdx),%eax
140: 48 83 c2 01 add $0x1,%rdx
144: 48 39 ca cmp %rcx,%rdx
147: 75 f2 jne 13b <crc32c_intel_update+0x2b>
As the compiler has some more freedom w.r.t. register allocation,
there is also a couple of reg-reg moves removed.
There are no hidden states for CRC32 insn, so there is no need to mark
assembly as volatile.
v2: Introduce CRC32_INST define.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
... and fold handling of misaligned case into it.
Implementation note: we stash the "will we need to rol8 the sum in the end"
flag into the MSB of %rcx (the lower 32 bits are used for length); the rest
is pretty straightforward.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers of these primitives will
* discard anything we might've copied in case of error
* ignore the csum value in case of error
* always pass 0xffffffff as the initial sum, so the
resulting csum value (in case of success, that is) will never be 0.
That suggest the following calling conventions:
* don't pass err_ptr - just return 0 on error.
* don't bother with zeroing destination, etc. in case of error
* don't pass the initial sum - just use 0xffffffff.
This commit does the minimal conversion in the instances of csum_and_copy_...();
the changes of actual asm code behind them are done later in the series.
Note that this asm code is often shared with csum_partial_copy_nocheck();
the difference is that csum_partial_copy_nocheck() passes 0 for initial
sum while csum_and_copy_..._user() pass 0xffffffff. Fortunately, we are
free to pass 0xffffffff in all cases and subsequent patches will use that
freedom without any special comments.
A part that could be split off: parisc and uml/i386 claimed to have
csum_and_copy_to_user() instances of their own, but those were identical
to the generic one, so we simply drop them. Not sure if it's worth
a separate commit...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's always 0. Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.
However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
quite a few architectures have the same csum_partial_copy_nocheck() -
simply memcpy() the data and then return the csum of the copy.
hexagon, parisc, ia64, s390, um: explicitly spelled out that way.
arc, arm64, csky, h8300, m68k/nommu, microblaze, mips/GENERIC_CSUM, nds32,
nios2, openrisc, riscv, unicore32: end up picking the same thing spelled
out in lib/checksum.h (with varying amounts of perversions along the way).
everybody else (alpha, arm, c6x, m68k/mmu, mips/!GENERIC_CSUM, powerpc,
sh, sparc, x86, xtensa) have non-generic variants. For all except c6x
the declaration is in their asm/checksum.h. c6x uses the wrapper
from asm-generic/checksum.h that would normally lead to the lib/checksum.h
instance, but in case of c6x we end up using an asm function from arch/c6x
instead.
Screw that mess - have architectures with private instances define
_HAVE_ARCH_CSUM_AND_COPY in their asm/checksum.h and have the default
one right in net/checksum.h conditional on _HAVE_ARCH_CSUM_AND_COPY
*not* defined.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add emulation/spoofing of SLDT and STR for both 32- and 64-bit
processes.
Wine users have found a small number of Windows apps using SLDT that
were crashing when run on UMIP-enabled systems.
Originally-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reported-by: Andreas Rammhold <andi@notmuch.email>
Signed-off-by: Brendan Shanks <bshanks@codeweavers.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20200710224525.21966-1-bshanks@codeweavers.com
The old _do_fork() helper is removed in favor of the new kernel_clone() helper.
The latter adheres to naming conventions for kernel internal syscall helpers.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Link: https://lore.kernel.org/r/20200819104655.436656-8-christian.brauner@ubuntu.com
Now that the old memmap code has been removed, some code that was left
behind in arch/x86/platform/efi/efi.c is only used for 32-bit builds,
which means it can live in efi_32.c as well. So move it over.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
When remapping the kernel rodata section RO in the EFI pagetables, the
protection flags that were used for the text section are being reused,
but the rodata section should not be marked executable.
Cc: <stable@vger.kernel.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200717194526.3452089-1-nivedita@alum.mit.edu
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type
was intended to be used by the kernel to filter out invalid error codes
on a system. However, this is unnecessary after a few product releases
because the hardware will only report valid error codes. Thus, there's
no need for it with future systems.
Remove the xec_bitmap field and all references to it.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com
We have some really ancient debug printouts in the x86 boot image build code:
Setup is 14108 bytes (padded to 14336 bytes).
System is 8802 kB
CRC 27e909d4
None of these ever helped debug any sort of breakage that I know of, and they
clutter the build output.
Remove them - if anyone needs the see the various interim stages of this to
debug an obscure bug, they can add these printfs and more.
We still keep this one:
Kernel: arch/x86/boot/bzImage is ready (#19)
As a sentimental leftover, plus the '#19' build count tag is mildly useful.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org
../arch/x86/pci/xen.c: In function ‘pci_xen_init’:
../arch/x86/pci/xen.c:410:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration]
acpi_noirq_set();
Fixes: 88e9ca161c ("xen/pci: Use acpi_noirq_set() helper to avoid #ifdef")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
The header file algapi.h includes skbuff.h unnecessarily since
all we need is a forward declaration for struct sk_buff. This
patch removes that inclusion.
Unfortunately skbuff.h pulls in a lot of things and drivers over
the years have come to rely on it so this patch adds a lot of
missing inclusions that result from this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since commits
c041b5ad86 ("x86, boot: Create a separate string.h file to provide standard string functions")
fb4cac573e ("x86, boot: Move memcmp() into string.h and string.c")
the decompressor stub has been using the compiler's builtin memcpy,
memset and memcmp functions, _except_ where it would likely have the
largest impact, in the decompression code itself.
Remove the #undef's of memcpy and memset in misc.c so that the
decompressor code also uses the compiler builtins.
The rationale given in the comment doesn't really apply: just because
some functions use the out-of-line version is no reason to not use the
builtin version in the rest.
Replace the comment with an explanation of why memzero and memmove are
being #define'd.
Drop the suggestion to #undef in boot/string.h as well: the out-of-line
versions are not really optimized versions, they're generic code that's
good enough for the preboot environment. The compiler will likely
generate better code for constant-size memcpy/memset/memcmp if it is
allowed to.
Most decompressors' performance is unchanged, with the exception of LZ4
and 64-bit ZSTD.
Before After ARCH
LZ4 73ms 10ms 32
LZ4 120ms 10ms 64
ZSTD 90ms 74ms 64
Measurements on QEMU on 2.2GHz Broadwell Xeon, using defconfig kernels.
Decompressor code size has small differences, with the largest being
that 64-bit ZSTD decreases just over 2k. The largest code size increase
was on 64-bit XZ, of about 400 bytes.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: Nick Terrell <nickrterrell@gmail.com>
Tested-by: Nick Terrell <nickrterrell@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
resctrl/core.c defines get_cache_id() for use in its cpu-hotplug
callbacks. This gets the id attribute of the cache at the corresponding
level of a CPU.
Later rework means this private function needs to be shared. Move
it to the header file.
The name conflicts with a different definition in intel_cacheinfo.c,
name it get_cpu_cacheinfo_id() to show its relation with
get_cpu_cacheinfo().
Now this is visible on other architectures, check the id attribute
has actually been set.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-11-james.morse@arm.com
Intel CPUs expect the cache bitmap provided by user-space to have on a
single span of 1s, whereas AMD can support bitmaps like 0xf00f. Arm's
MPAM support also allows sparse bitmaps.
Similarly, Intel CPUs check at least one bit set, whereas AMD CPUs are
quite happy with an empty bitmap. Arm's MPAM allows an empty bitmap.
To move resctrl out to /fs/, platform differences like this need to be
explained.
Add two resource properties arch_has_{empty,sparse}_bitmaps. Test these
around the relevant parts of cbm_validate().
Merging the validate calls causes AMD to gain the min_cbm_bits test
needed for Haswell, but as it always sets this value to 1, it will never
match.
[ bp: Massage commit message. ]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-10-james.morse@arm.com
- Fix typos.
- Move the compiler barrier comment to the top, because it's valid for the
whole function, not just the legacy branch.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200818053130.GA3161093@gmail.com
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Now after arch_needs_linear has been added, the parse_bw() calls are
almost the same between AMD and Intel.
The difference is '!is_mba_sc()', which is not checked on AMD. This
will always be true on AMD CPUs as mba_sc cannot be enabled as
is_mba_linear() is false.
Removing this duplication means user-space visible behaviour and
error messages are not validated or generated in different places.
Reviewed-by : Babu Moger <babu.moger@amd.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-9-james.morse@arm.com
The configuration values user-space provides to the resctrl filesystem
are ABI. To make this work on another architecture, all the ABI bits
should be moved out of /arch/x86 and under /fs.
To do this, the differences between AMD and Intel CPUs needs to be
explained to resctrl via resource properties, instead of function
pointers that let the arch code accept subtly different values on
different platforms/architectures.
For MBA, Intel CPUs reject configuration attempts for non-linear
resources, whereas AMD ignore this field as its MBA resource is never
linear. To merge the parse/validate functions, this difference needs to
be explained.
Add struct rdt_membw::arch_needs_linear to indicate the arch code needs
the linear property to be true to configure this resource. AMD can set
this and delay_linear to false. Intel can set arch_needs_linear to
true to keep the existing "No support for non-linear MB domains" error
message for affected platforms.
[ bp: convert "we" etc to passive voice. ]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-8-james.morse@arm.com
rdtgroup_tasks_assigned() and show_rdt_tasks() loop over threads testing
for a CTRL/MON group match by closid/rmid with the provided rdtgrp.
Further down the file are helpers to do this, move these further up and
make use of them here.
These helpers additionally check for alloc/mon capable. This is harmless
as rdtgroup_mkdir() tests these capable flags before allowing the config
directories to be created.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-7-james.morse@arm.com
mbm_handle_overflow() and cqm_handle_limbo() are both provided with
the domain's work_struct when called, but use get_domain_from_cpu()
to find the domain, along with the appropriate error handling.
container_of() saves some list walking and bitmap testing, use that
instead.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-5-james.morse@arm.com
The comment in rdtgroup_init() refers to the non existent function
rdt_mount(), which has now been renamed rdt_get_tree(). Fix the
comment.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-4-james.morse@arm.com
max_delay is used by x86's __get_mem_config_intel() as a local variable.
Remove it, replacing it with a local variable.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20200708163929.2783-3-james.morse@arm.com
Starts from Ice Lake, the TopDown metrics are directly available as
fixed counters and do not require generic counters. Also, the TopDown
metrics can be collected per thread. Extend the RDPMC usage to support
per-thread TopDown metrics.
The RDPMC index of the PERF_METRICS will be output if RDPMC users ask
for the RDPMC index of the metrics events.
To support per thread RDPMC TopDown, the metrics and slots counters have
to be saved/restored during the context switching.
The last_period and period_left are not used in the counting mode. Use
the fields for saved_metric and saved_slots.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-12-kan.liang@linux.intel.com
Ice Lake supports the hardware TopDown metrics feature, which can free
up the scarce GP counters.
Update the event constraints for the metrics events. The metric counters
do not exist, which are mapped to a dummy offset. The sharing between
multiple users of the same metric without multiplexing is not allowed.
Implement set_topdown_event_period for Ice Lake. The values in
PERF_METRICS MSR are derived from the fixed counter 3. Both registers
should start from zero.
Implement update_topdown_event for Ice Lake. The metric is reported by
multiplying the metric (fraction) with slots. To maintain accurate
measurements, both registers are cleared for each update. The fixed
counter 3 should always be cleared before the PERF_METRICS.
Implement td_attr for the new metrics events and the new slots fixed
counter. Make them visible to the perf user tools.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com
The RDPMC base offset of fixed counters is hard-code. Use a meaningful
name to replace the magic number to improve the readability of the code.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-10-kan.liang@linux.intel.com
Intro
=====
The TopDown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. Current perf has supported the method.
The method works well, but there is one problem. To collect the TopDown
events, several GP counters have to be used. If a user wants to collect
other events at the same time, the multiplexing probably be triggered,
which impacts the accuracy.
To free up the scarce GP counters, the hardware TopDown metrics feature
is introduced from Ice Lake. The hardware implements an additional
"metrics" register and a new Fixed Counter 3 that measures pipeline
"slots". The TopDown events can be calculated from them instead.
Events
======
The level 1 TopDown has four metrics. There is no event-code assigned to
the TopDown metrics. Four metric events are exported as separate perf
events, which map to the internal "metrics" counter register. Those
events do not exist in hardware, but can be allocated by the scheduler.
For the event mapping, a special 0x00 event code is used, which is
reserved for fake events. The metric events start from umask 0x10.
When setting up the metric events, they point to the Fixed Counter 3.
They have to be specially handled.
- Add the update_topdown_event() callback to read the additional metrics
MSR and generate the metrics.
- Add the set_topdown_event_period() callback to initialize metrics MSR
and the fixed counter 3.
- Add a variable n_metric_event to track the number of the accepted
metrics events. The sharing between multiple users of the same metric
without multiplexing is not allowed.
- Only enable/disable the fixed counter 3 when there are no other active
TopDown events, which avoid the unnecessary writing of the fixed
control register.
- Disable the PMU when reading the metrics event. The metrics MSR and
the fixed counter 3 are read separately. The values may be modified by
an NMI.
All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.
The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.
Groups
======
The slots event is required in a Topdown group.
To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by slots event in a group.
All active slots and metrics events will be updated one time.
Therefore, the slots event must be before any metric events in a Topdown
group.
NMI
======
The METRICS related register may be overflow. The bit 48 of the STATUS
register will be set. If so, PERF_METRICS and Fixed counter 3 are
required to be reset. The patch also update all active slots and
metrics events in the NMI handler.
The update_topdown_event() has to read two registers separately. The
values may be modified by an NMI. PMU has to be disabled before calling
the function.
RDPMC
======
RDPMC is temporarily disabled. A later patch will enable it.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com
Currently, the if-else is used in the intel_pmu_disable/enable_event to
check the type of an event. It works well, but with more and more types
added later, e.g., perf metrics, compared to the switch statement, the
if-else may impair the readability of the code.
There is no harm to use the switch statement to replace the if-else
here. Also, some optimizing compilers may compile a switch statement
into a jump-table which is more efficient than if-else for a large
number of cases. The performance gain may not be observed for now,
because the number of cases is only 5, but the benefits may be observed
with more and more types added in the future.
Use switch to replace the if-else in the intel_pmu_disable/enable_event.
If the idx is invalid, print a warning.
For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't
need to check the event->attr.precise_ip. Use return for the case.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-7-kan.liang@linux.intel.com
Bit 15 of the PERF_CAPABILITIES MSR indicates that the perf METRICS
feature is supported. The perf METRICS is not a PEBS feature.
Rename pebs_metrics_available perf_metrics.
The bit is not used in the current code. It will be used in a later
patch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-6-kan.liang@linux.intel.com
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow
status of the PERF_METRICS counters.
Move the BTS index to the bit 47.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-5-kan.liang@linux.intel.com
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to
measure the level 1 TopDown events.
Add MSR address and macros for the new fixed counter, which will be used
in a later patch.
Add comments to explain the event encoding rules for the fixed counters.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-4-kan.liang@linux.intel.com
Magic numbers are used in the current NMI handler for the global status
bit. Use a meaningful name to replace the magic numbers to improve the
readability of the code.
Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros
to reduce the length of the line.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-3-kan.liang@linux.intel.com
The RDPMC index is always re-calculated for the RDPMC userspace support,
which is unnecessary.
The RDPMC index value is stored in the variable event_base_rdpmc for
the kernel usage, which can be used for RDPMC userspace support as well.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200723171117.9918-2-kan.liang@linux.intel.com
Current minimum required version of binutils is 2.23, which supports
XGETBV and XSETBV instruction mnemonics.
Replace the byte-wise specification of XGETBV and XSETBV with these
proper mnemonics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200707174722.58651-1-ubizjak@gmail.com
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: b9baba8614 ("KVM, pkeys: expose CPUID/CR4 to guest")
Cc: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200817181655.3716509-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: 0be0226f07 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200817181655.3716509-2-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200807032833.17484-1-ricardo.neri-calderon@linux.intel.com
Linux only has support to read total DDR reads and writes. Here we
add support to enable bandwidth breakdown-GT, IA and IO. Breakdown
of BW is important to debug and optimize memory access. This can also
be used for telemetry and improving the system software.The offsets for
GT, IA and IO are added and these free running counters can be accessed
via MMIO space.
The BW breakdown can be measured using the following cmd:
perf stat -e uncore_imc/gt_requests/,uncore_imc/ia_requests/,uncore_imc/io_requests/
30.57 MiB uncore_imc/gt_requests/
1346.13 MiB uncore_imc/ia_requests/
190.97 MiB uncore_imc/io_requests/
5.984572733 seconds time elapsed
BW/s = <gt,ia,io>_requests/time elapsed
Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814022234.23605-1-vaibhav.shankar@intel.com
- Fix mitigation state sysfs output
- Fix an FPU xstate/sxave code assumption bug triggered by Architectural LBR support
- Fix Lightning Mountain SoC TSC frequency enumeration bug
- Fix kexec debug output
- Fix kexec memory range assumption bug
- Fix a boundary condition in the crash kernel code
- Optimize porgatory.ro generation a bit
- Enable ACRN guests to use X2APIC mode
- Reduce a __text_poke() IRQs-off critical section for the benefit of PREEMPT_RT
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl83ybgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iJnQ/+OAkE5hiQ+F1ikQ4rKyjaT6FjvynReNUA
ysQjcCypGB4x+slR8o3k5yrzYJ9WbDfOz7a0uekZtNHvJ80+3yheV5Yvf+Uz3EYM
Jj/OubCNMNnvS5cJMNXs196SGd/ELLWBbCjwUWPsiWJ0ZMTgKmpZz1LgB1QZjhyw
fbAc1WgTLVO+emE5FwBrmFzvgBxn5EtiFoLhegFtACHadNcJLiKpXpiK3NKkEirO
owF1/Qg6mn6MowKDBDkWgmwi0HVYbraqu0hXRrCq9o105CVwgwUdORTwjK3rnUNs
et10Zz2UmSpjXJOhKZdZLFCtYOmrADmS4pnoXF6W6cLLFvkq4b2ducnlFBtNKqMh
ljPkIT04sF99gIKijEYWsru+MgS4qO1VNHtJxkr/ZCUjqahsa1nN9F0lP0QOXjwf
hbK4h1NrML3UiCGAe2hjIh9zY2c8s2Q90PyCvZkKNKquSQ1E011hzcEE2RIoBBYB
mc1d6lgfCFWVkbgRA5sx1CVtgnAvHk2wu9w/8N9XTGjPgiQJRr3I8cNUZw59gaMH
43auWyvpVAA4vdfbKJrPVrTLhTTnQYv0A966l7/i0d8MkGN4u09sAiB3ZevZMEK9
45b7IXWluCi0ikBAmCvQ+qEzhg7pApCziVKuaZ/4j+qPLTDAutGwz7YuaXyOKrUX
Aj/uCev6D6c=
=fvpv
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes and small updates all around the place:
- Fix mitigation state sysfs output
- Fix an FPU xstate/sxave code assumption bug triggered by
Architectural LBR support
- Fix Lightning Mountain SoC TSC frequency enumeration bug
- Fix kexec debug output
- Fix kexec memory range assumption bug
- Fix a boundary condition in the crash kernel code
- Optimize porgatory.ro generation a bit
- Enable ACRN guests to use X2APIC mode
- Reduce a __text_poke() IRQs-off critical section for the benefit of
PREEMPT_RT"
* tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Acquire pte lock with interrupts enabled
x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
x86/purgatory: Don't generate debug info for purgatory.ro
x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
kexec_file: Correctly output debugging information for the PT_LOAD ELF header
kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
x86/crash: Correct the address boundary of function parameters
x86/acrn: Remove redundant chars from ACRN signature
x86/acrn: Allow ACRN guest to use X2APIC mode
plus a RAPL HW-enablement for Intel SPR platforms.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl83xBQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gb9RAArM0jJemRPHv1a/xLhrRo/cKURrOWpNl0
OQtgppEv9axkavYL34eyoax4LTDCFXxE+NDClSC16abFEVPNriGODNE+CMbFgMbW
AyDfP+AsDdNExwl+JWR+J37KIpEIzWLqtjzEjVxZqsuov3C+EaLU4gv947UFohxM
QE93d8q3znBSdMjeC/aZyL8iX4aCB0oMjrP7BMXo9a61/oseKLnvE8Zu/ESFDe1S
TYZ+VlCxyaZOUBkEyd8+h/CBL8kOvQ2ObBEBxmyQQdGuRZ20BcJRodk3g+mOdnHJ
zeohRcXvIHskHTEVeQv+Eh4EitFT3bEFrbk0LwMhKubIhFTKIB42sAzyeC6iUGc/
O5+Qe+bn3kYMynMHNo1yfh0s0S3cU3cfBnC1I2A/NyAn49H0UPr+rjynuKHtCA1+
S36Q9BydZegU/jyhbbDs+h/cdOiKY2F3MPEAZg3u/7EM+NIrmvuQoA7+C33fmLA+
tZzpeDpqNKz65JgYDQ2sZdghyVp41KTogeTm6Xu5O3sLhCnATiyqL2z2LCoWj+yZ
KuZ+zHtV8ajRwt1bhq7qFUIyQLsHHUlz5z7TiUC7qqB48LpxO7LiTZ7CxUDY432N
Xz8QPD/D71HAWmbkAXUih+JXG0nQSdlF6Xpwquczqc/8odJ46xdQ+i5wIgBOcudP
A+kEXRqz5rA=
=NsxB
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes, an expansion of perf syscall access to CAP_PERFMON
privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"
* tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/rapl: Add support for Intel SPR platform
perf/x86/rapl: Support multiple RAPL unit quirks
perf/x86/rapl: Fix missing psys sysfs attributes
hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
kprobes: Remove show_registers() function prototype
perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
The comment explaining why 4-level systems only need to allocate on
the P4D level caused some confustion. Update it to better explain why
on 4-level systems the allocation on PUD level is necessary.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814151947.26229-3-joro@8bytes.org
Remove the code to sync the vmalloc and ioremap ranges for x86-64. The
page-table pages are all pre-allocated so that synchronization is
no longer necessary.
This is a patch that already went into the kernel as:
commit 8bb9bf242d ("x86/mm/64: Do not sync vmalloc/ioremap mappings")
But it had to be reverted later because it unveiled a bug from:
commit 6eb82f9940 ("x86/mm: Pre-allocate P4D/PUD pages for vmalloc area")
The bug in that commit causes the P4D/PUD pages not to be correctly
allocated, making the synchronization still necessary. That issue got
fixed meanwhile upstream:
commit 995909a4e2 ("x86/mm/64: Do not dereference non-present PGD entries")
With that fix it is safe again to remove the page-table synchronization
for vmalloc/ioremap ranges on x86-64.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814151947.26229-2-joro@8bytes.org
pte_clear() et al are based on two paravirt steps today: one step to
create a page table entry with all zeroes, and one step to write this
entry value.
Drop the first step as it is completely useless.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200815100641.26362-7-jgross@suse.com
On x86 set_pte_at() is now always falling back to set_pte(). So instead
of having this fallback after the paravirt maze just drop the
set_pte_at paravirt operation and let set_pte_at() use the set_pte()
function directly.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200815100641.26362-6-jgross@suse.com
With 32-bit Xen PV support gone, the following commit is not needed
anymore:
a4c0e91d1d ("x86/entry/32: Fix XEN_PV build dependency")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200815100641.26362-5-jgross@suse.com
There are some code parts using CONFIG_PARAVIRT for Xen pvops related
issues instead of the more stringent CONFIG_PARAVIRT_XXL.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200815100641.26362-4-jgross@suse.com
The last 32-bit user of stuff under CONFIG_PARAVIRT_XXL is gone.
Remove 32-bit specific parts.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200815100641.26362-2-jgross@suse.com
Since commit 61a47c1ad3 ("sysctl: Remove the sysctl system call"),
sys_sysctl is actually unavailable: any input can only return an error.
We have been warning about people using the sysctl system call for years
and believe there are no more users. Even if there are users of this
interface if they have not complained or fixed their code by now they
probably are not going to, so there is no point in warning them any
longer.
So completely remove sys_sysctl on all architectures.
[nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Will Deacon <will@kernel.org> [arm/arm64]
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: chenzefeng <chenzefeng2@huawei.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kars de Jong <jongk@linux-m68k.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Paul Burton <paulburton@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Sven Schnelle <svens@stackframe.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the counter
read function when time namespace support is enabled. Adding the pointer
is a NOOP for all other architectures because the compiler is supposed
to optimize that out when it is unused in the architecture specific
inline. The change also solved a similar problem for MIPS which
fortunately has time namespaces not yet enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another sequence
counter in the S390 implementation. A better solution is to utilize the
already existing VDSO sequence count for this. The core code now exposes
helper functions which allow to serialize against the timekeeper code
and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It now has
an embedded architecture specific struct embedded which defaults to an
empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support to
work from a common upstream base.
- A trivial comment fix.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl82tGYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRkKD/9YEYlYPQ4omRNVNIJRnalBH6OB/GOk
jTJ4RCvNP2ew6XtgEz5Yg1VqxrmJP4MLNCnMr7mQulfezUmslK0uJMlqZC4dgYth
PUhliLyFi5PK+CKaY+2NFlZMAoE53YlJ2FVPq114FUW4ASVbucDPXpmhO22cc2Iu
0RD3z9/+vQmA8lUqI6wPIFTC+euN+2kbkeZjt7BlkBAdiRBga5UnarFzetq0nWyc
kcprQ2qZfGLYzRY6dRuvNLz27Ta7SAlVGOGUDpWr9MISLDFQzHwhVATDNFW3hLGT
Fr5xNqStUVxxTzYkfCj/Podez0aR3por8bm9SoWxZn7oeLdLgTsDwn2pY0J0PjyB
wWz9lmqT1vzrHEfQH1YhHvycowl6azue9rT2ERWwZTdbADEwu6Zr8ufv2XHcMu0J
dyzSYa81cQrTeAwwdNjODs+QCTX+0G6u86AU2Xg+YgqkAywcAMvzcff/9D62hfv2
5BSz+0OeitQCnSvHILUPw4XT/2rNZfhlcmc4tkzoBFewzDsMEqWT19p+GgqcRNiU
5Jl4kGnaeHjP0e5Vn/ZJurKaF3YEJwgjkohDORloaqo0AXiYo1ANhDlKvSRu5hnU
GDIWOVu8ATXwkjMFcLQz7O5/J1MqJCkleIjSCDjLDhhMbLY/nR9L3QS9jbqiVVRN
nTZlSMF6HeQmew==
=y8Z5
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
"A set of timekeeping/VDSO updates:
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the
counter read function when time namespace support is enabled.
Adding the pointer is a NOOP for all other architectures because
the compiler is supposed to optimize that out when it is unused in
the architecture specific inline. The change also solved a similar
problem for MIPS which fortunately has time namespaces not yet
enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another
sequence counter in the S390 implementation. A better solution is
to utilize the already existing VDSO sequence count for this. The
core code now exposes helper functions which allow to serialize
against the timekeeper code and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It
now has an embedded architecture specific struct embedded which
defaults to an empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support
to work from a common upstream base.
- A trivial comment fix"
* tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Delete repeated words in comments
lib/vdso: Allow to add architecture-specific vdso data
timekeeping/vsyscall: Provide vdso_update_begin/end()
vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
posix CPU timers into task work context. The tick interrupt is reduced to a
quick check which queues the work which is doing the heavy lifting before
returning to user space or going back to guest mode. Moving this out is
deferring the signal delivery slightly but posix CPU timers are inaccurate
by nature as they depend on the tick so there is no real damage. The
relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context tick
handler, but it also has the general benefit that the actual heavy work is
accounted to the task/process and not to the tick interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of posix
CPU timers (which are unpriviledged) is armed for a task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which was
just established for x86 with the new common entry/exit code which got
merged post 5.8 and is not the case for other KVM architectures.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl82sRkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUs2D/9IZuALnVXtnvsOQh5uMRpxr/I6tpQm
KJSRkcSSne9rIV3dQlswDdaT7bGibd7pbKQOnlA0vc37vDwaJHEzmTOJGpHpHnMA
fHH2QP3LL2oZ1d7DG6eNJESCmaFBcaYXNbKtluOWQzHQhd9P8yHb4N+kzfxHK0Fr
uNd+cd6T658xPsNOLaLP3MG2Yz0rVt2F5c1v8n78NfibeKckYhPov8cwVrf2WGWr
XFHKorx4lXZ+vFwKEeZ7qQtqvAsLDixgMkFfY2GGSPhd1AMAaIUICZgsdEj2gg7H
YK+lwA0uoqPaXshOCmdkCLkfPA7BRmAySWE7jUPbIvRqM94Uapk9+4CqjgiH1Qs+
T8CWbcZk8tZACFrouhZkhrnjUTev/vE7oirsjn26DRY68/Ec7llpCOjvVA7HZWqN
vJ/BN35IufA7WEkf2TWNv5mg1zIlHI0O17zDifFq4g2VKFDVvQB0QYWlvug/eAu9
zYNX3WwA/IP8C9EOHZt54e6AKH8F3dT04oLFUkmRIcVKv1SEbdFufVfV7RavPEwK
P21JNXPDdd0aLUO7ksqyQN7pyR3puGXSCb5NAPtZY6UWSMN4G/3SVry3mJa/0BJd
mn+uYGpo9vmceh90vAHBoGIena/pez/PyRLWgGeT9jMjk95rNY0sEhaLEAOF9AR5
ck+3K2rY0S3wwQ==
=Reot
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more timer updates from Thomas Gleixner:
"A set of posix CPU timer changes which allows to defer the heavy work
of posix CPU timers into task work context. The tick interrupt is
reduced to a quick check which queues the work which is doing the
heavy lifting before returning to user space or going back to guest
mode. Moving this out is deferring the signal delivery slightly but
posix CPU timers are inaccurate by nature as they depend on the tick
so there is no real damage. The relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context
tick handler, but it also has the general benefit that the actual
heavy work is accounted to the task/process and not to the tick
interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of
posix CPU timers (which are unpriviledged) is armed for a
task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which
was just established for x86 with the new common entry/exit code which
got merged post 5.8 and is not the case for other KVM architectures"
* tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Select POSIX_CPU_TIMERS_TASK_WORK
posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
posix-cpu-timers: Split run_posix_cpu_timers()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCXzaSXAAKCRCAXGG7T9hj
vuSEAP4qOIv7Hr1wMJfTsN7ZoNNr/K6ph8ADcjFm9RGikn8MawD8CU/OfcFKJFwl
UVwM1HPnRG6pvCI9bmHS4WYrIBYBVw0=
=Bi6R
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- Remove support for running as 32-bit Xen PV-guest.
32-bit PV guests are rarely used, are lacking security fixes for
Meltdown, and can be easily replaced by PVH mode. Another series for
doing more cleanup will follow soon (removal of 32-bit-only pvops
functionality).
- Fixes and additional features for the Xen display frontend driver.
* tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drm/xen-front: Pass dumb buffer data offset to the backend
xen: Sync up with the canonical protocol definition in Xen
drm/xen-front: Add YUYV to supported formats
drm/xen-front: Fix misused IS_ERR_OR_NULL checks
xen/gntdev: Fix dmabuf import with non-zero sgt offset
x86/xen: drop tests for highmem in pv code
x86/xen: eliminate xen-asm_64.S
x86/xen: remove 32-bit Xen PV guest support
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAl82Y6cTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXlcxCACP21ZI7RZvQBcFtTj5MWa0uwoofFqF
JDG0MvZ5zFKIJFX0pwlZIUrZY5aVJ1NwCDgCI0EXbZEazTaNCD2knFPqrLe3WUFY
mSDF9df7oW9UvTe9L4g3rYAdqsrkbgqhBypm9Vpbcazg/Ki6QVCgAhIo1lbq62+m
J2/0kLO1lVY6opr6vyobaWbm/Y4b0fbrx7N6KwUDhZUYGLGKaOc+WvsZinNl4XW6
VPiEVQUApvVxwG43rLNXjPe83DtassJ2GevSS1whXnZ+K0bViWhyYicbqEl9iV1i
nlNIkEMX5A1rdwV1zEAGyY/zWi+fi2+IdKGGEbtyUsely1vHtZuaDCiQ
=DE2Y
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyper-v fixes from Wei Liu:
- fix oops reporting on Hyper-V
- make objtool happy
* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Make hv_setup_sched_clock inline
Drivers: hv: vmbus: Only notify Hyper-V for die events that are oops
syzbot found its way in 86_fsgsbase_read_task() and triggered this oops:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0
RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393
Call Trace:
putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876
genregs32_set arch/x86/kernel/ptrace.c:1026 [inline]
genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006
copy_regset_from_user include/linux/regset.h:326 [inline]
ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline]
compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198
__do_compat_sys_ptrace kernel/ptrace.c:1420 [inline]
__se_compat_sys_ptrace kernel/ptrace.c:1389 [inline]
__ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389
do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline]
__do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126
do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
This can happen if ptrace() or sigreturn() pokes an LDT selector into FS
or GS for a task with no LDT and something tries to read the base before
a return to usermode notices the bad selector and fixes it.
The fix is to make sure ldt pointer is not NULL.
Fixes: 07e1d88ada ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately")
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Chang S. Bae <chang.seok.bae@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Markus T Metzger <markus.t.metzger@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a linker script check that there are no run-time relocations, and
remove the old one that tries to check via looking for specially-named
sections in the object files.
Drop the tests for -fPIE compiler option and -pie linker option, as they
are available in all supported gcc and binutils versions (as well as
clang and lld).
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lore.kernel.org/r/20200731230820.1742553-8-keescook@chromium.org
The BFD linker generates run-time relocations for z_input_len and
z_output_len, even though they are absolute symbols.
This is fixed for binutils-2.35 [1]. Work around this for earlier
versions by defining two variables input_len and output_len in addition
to the symbols, and use them via position-independent references.
This eliminates the last two run-time relocations in the head code and
allows us to drop the -z noreloc-overflow flag to the linker.
Move the -pie and --no-dynamic-linker LDFLAGS to LDFLAGS_vmlinux instead
of KBUILD_LDFLAGS. There shouldn't be anything else getting linked, but
this is the more logical location for these flags, and modversions might
call the linker if an EXPORT_SYMBOL is left over accidentally in one of
the decompressors.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=25754
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/r/20200731230820.1742553-7-keescook@chromium.org
The assembly code in head_{32,64}.S, while meant to be
position-independent, generates run-time relocations because it uses
instructions such as:
leal gdt(%edx), %eax
which make the assembler and linker think that the code is using %edx as
an index into gdt, and hence gdt needs to be relocated to its run-time
address.
On 32-bit, with lld Dmitry Golovin reports that this results in a
link-time error with default options (i.e. unless -z notext is
explicitly passed):
LD arch/x86/boot/compressed/vmlinux
ld.lld: error: can't create dynamic relocation R_386_32 against local
symbol in readonly segment; recompile object files with -fPIC or pass
'-Wl,-z,notext' to allow text relocations in the output
With the BFD linker, this generates a warning during the build, if
--warn-shared-textrel is enabled, which at least Gentoo enables by
default:
LD arch/x86/boot/compressed/vmlinux
ld: arch/x86/boot/compressed/head_32.o: warning: relocation in read-only section `.head.text'
ld: warning: creating a DT_TEXTREL in object
On 64-bit, it is not possible to link the kernel as -pie with lld, and
it is only possible with a BFD linker that supports -z noreloc-overflow,
i.e. versions >2.26. This is because these instructions cannot really be
relocated: the displacement field is only 32-bits wide, and thus cannot
be relocated for a 64-bit load address. The -z noreloc-overflow option
simply overrides the linker error, and results in R_X86_64_RELATIVE
relocations that apply a 64-bit relocation to a 32-bit field anyway.
This happens to work because nothing will process these run-time
relocations.
Start fixing this by removing relocations from .head.text:
- On 32-bit, use a base register that holds the address of the GOT and
reference symbol addresses using @GOTOFF, i.e.
leal gdt@GOTOFF(%edx), %eax
- On 64-bit, most of the code can (and already does) use %rip-relative
addressing, however the .code32 bits can't, and the 64-bit code also
needs to reference symbol addresses as they will be after moving the
compressed kernel to the end of the decompression buffer.
For these cases, reference the symbols as an offset to startup_32 to
avoid creating relocations, i.e.:
leal (gdt-startup_32)(%bp), %eax
This only works in .head.text as the subtraction cannot be represented
as a PC-relative relocation unless startup_32 is in the same section
as the code. Move efi32_pe_entry into .head.text so that it can use
the same method to avoid relocations.
Reported-by: Dmitry Golovin <dima@golovin.in>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/r/20200731230820.1742553-6-keescook@chromium.org
GCC puts the main function into .text.startup when compiled with -Os (or
-O2). This results in arch/x86/boot/main.c having a .text.startup
section which is currently not included explicitly in the linker script
setup.ld in the same directory.
The BFD linker places this orphan section immediately after .text, so
this still works. However, LLD git, since [1], is choosing to place it
immediately after the .bstext section instead (this is the first code
section). This plays havoc with the section layout that setup.elf
requires to create the setup header, for eg on 64-bit:
LD arch/x86/boot/setup.elf
ld.lld: error: section .text.startup file range overlaps with .header
>>> .text.startup range is [0x200040, 0x2001FE]
>>> .header range is [0x2001EF, 0x20026B]
ld.lld: error: section .header file range overlaps with .bsdata
>>> .header range is [0x2001EF, 0x20026B]
>>> .bsdata range is [0x2001FF, 0x200398]
ld.lld: error: section .bsdata file range overlaps with .entrytext
>>> .bsdata range is [0x2001FF, 0x200398]
>>> .entrytext range is [0x20026C, 0x2002D3]
ld.lld: error: section .text.startup virtual address range overlaps
with .header
>>> .text.startup range is [0x40, 0x1FE]
>>> .header range is [0x1EF, 0x26B]
ld.lld: error: section .header virtual address range overlaps with
.bsdata
>>> .header range is [0x1EF, 0x26B]
>>> .bsdata range is [0x1FF, 0x398]
ld.lld: error: section .bsdata virtual address range overlaps with
.entrytext
>>> .bsdata range is [0x1FF, 0x398]
>>> .entrytext range is [0x26C, 0x2D3]
ld.lld: error: section .text.startup load address range overlaps with
.header
>>> .text.startup range is [0x40, 0x1FE]
>>> .header range is [0x1EF, 0x26B]
ld.lld: error: section .header load address range overlaps with
.bsdata
>>> .header range is [0x1EF, 0x26B]
>>> .bsdata range is [0x1FF, 0x398]
ld.lld: error: section .bsdata load address range overlaps with
.entrytext
>>> .bsdata range is [0x1FF, 0x398]
>>> .entrytext range is [0x26C, 0x2D3]
Add .text.* to the .text output section to fix this, and also prevent
any future surprises if the compiler decides to create other such
sections.
[1] https://reviews.llvm.org/D75225
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/r/20200731230820.1742553-5-keescook@chromium.org
In a previous patch, we have eliminated GOT entries from the decompressor
binary and added an assertion that the .got section is empty. This means
that the GOT fixup routines that exist in both the 32-bit and 64-bit
startup routines have become dead code, and can be removed.
While at it, drop the KEEP() from the linker script, as it has no effect
on the contents of output sections that are created by the linker itself.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200731230820.1742553-4-keescook@chromium.org
Eliminate all GOT entries in the decompressor binary, by forcing hidden
visibility for all symbol references, which informs the compiler that
such references will be resolved at link time without the need for
allocating GOT entries.
To ensure that no GOT entries will creep back in, add an assertion to
the decompressor linker script that will fire if the .got section has
a non-zero size.
[Arvind: move hidden.h to include/linux instead of making a copy]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200731230820.1742553-3-keescook@chromium.org
The .got.plt section contains the part of the GOT which is used by PLT
entries, and which gets updated lazily by the dynamic loader when
function calls are dispatched through those PLT entries.
On fully linked binaries such as the kernel proper or the decompressor,
this never happens, and so in practice, the .got.plt section consists
only of the first 3 magic entries that are meant to point at the _DYNAMIC
section and at the fixup routine in the loader. However, since we don't
use a dynamic loader, those entries are never populated or used.
This means that treating those entries like ordinary GOT entries, and
updating their values based on the actual placement of the executable in
memory is completely pointless, and we can just ignore the .got.plt
section entirely, provided that it has no additional entries beyond
the first 3 ones.
So add an assertion in the linker script to ensure that this assumption
holds, and move the contents out of the [_got, _egot) memory range that
is modified by the GOT fixup routines.
While at it, drop the KEEP(), since it has no effect on the contents
of output sections that are created by the linker itself.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Arvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200731230820.1742553-2-keescook@chromium.org
Intel SPR platform uses fixed 16 bit energy unit for DRAM RAPL domain,
and fixed 0 bit energy unit for Psys RAPL domain.
After this, on SPR platform the energy counters appear in perf list.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20200811153149.12242-4-rui.zhang@intel.com
There will be more platforms with different fixed energy units.
Enhance the code to support different RAPL unit quirks for different
platforms.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20200811153149.12242-3-rui.zhang@intel.com
This fixes a problem introduced by commit:
5fb5273a90 ("perf/x86/rapl: Use new MSR detection interface")
that perf event sysfs attributes for psys RAPL domain are missing.
Fixes: 5fb5273a90 ("perf/x86/rapl: Use new MSR detection interface")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20200811153149.12242-2-rui.zhang@intel.com
pte lock is never acquired in-IRQ context so it does not require interrupts
to be disabled. The lock is a regular spinlock which cannot be acquired
with interrupts disabled on RT.
RT complains about pte_lock() in __text_poke() because it's invoked after
disabling interrupts.
__text_poke() has to disable interrupts as use_temporary_mm() expects
interrupts to be off because it invokes switch_mm_irqs_off() and uses
per-CPU (current active mm) data.
Move the PTE lock handling outside the interrupt disabled region.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200813105026.bvugytmsso6muljw@linutronix.de
* Improvements and bugfixes for secure VM support, giving reduced startup
time and memory hotplug support.
* Locking fixes in nested KVM code
* Increase number of guests supported by HV KVM to 4094
* Preliminary POWER10 support
ARM:
* Split the VHE and nVHE hypervisor code bases, build the EL2 code
separately, allowing for the VHE code to now be built with instrumentation
* Level-based TLB invalidation support
* Restructure of the vcpu register storage to accomodate the NV code
* Pointer Authentication available for guests on nVHE hosts
* Simplification of the system register table parsing
* MMU cleanups and fixes
* A number of post-32bit cleanups and other fixes
MIPS:
* compilation fixes
x86:
* bugfixes
* support for the SERIALIZE instruction
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8yfuQUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNweQgAiEycRbpifAueihK3ScKwYcCFhbHg
n6KLiFCY3sJRg+ORNb9EuFPJgGygV8DPKbEMvKaGDhNpX3rOpSIrpi5QQ5Hx+WOj
WHg+aX8Eyy1ys7V84UbiMeZKUbKDDRr0/UOUtJEsF4hiD7s0FgobbQhC/3+awp5k
sdSTMYlXelep+pjdFX7cNIgjrBNFtqH0ECeuDCcQzDg2zlH+poEPyLaC5+U4RF6r
pfvcxd6xp50fobo8ro7kMuBeclG3JxLjqqdNrkkHcF1DxROMLLKN7CjHZchYC/BK
c+S7JHLFnafxiTncMLhv3s4viey05mohW6SxeLw4qcWHfFlz+qyfZwMvZA==
=d/GI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"PPC:
- Improvements and bugfixes for secure VM support, giving reduced
startup time and memory hotplug support.
- Locking fixes in nested KVM code
- Increase number of guests supported by HV KVM to 4094
- Preliminary POWER10 support
ARM:
- Split the VHE and nVHE hypervisor code bases, build the EL2 code
separately, allowing for the VHE code to now be built with
instrumentation
- Level-based TLB invalidation support
- Restructure of the vcpu register storage to accomodate the NV code
- Pointer Authentication available for guests on nVHE hosts
- Simplification of the system register table parsing
- MMU cleanups and fixes
- A number of post-32bit cleanups and other fixes
MIPS:
- compilation fixes
x86:
- bugfixes
- support for the SERIALIZE instruction"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup
x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled
MIPS: KVM: Convert a fallthrough comment to fallthrough
MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64
x86: Expose SERIALIZE for supported cpuid
KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled
KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort()
KVM: arm64: Don't skip cache maintenance for read-only memslots
KVM: arm64: Handle data and instruction external aborts the same way
KVM: arm64: Rename kvm_vcpu_dabt_isextabt()
KVM: arm: Add trace name for ARM_NISV
KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text
KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS
KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE
KVM: PPC: Book3S HV: Rework secure mem slot dropping
KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up
KVM: PPC: Book3S HV: Migrate hot plugged memory
KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs
KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs
KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
...
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
Use the general page fault accounting by passing regs into
handle_mm_fault().
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Page fault accounting cleanups", v5.
This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b98270 ("mm: allow
VM_FAULT_RETRY for multiple times"):
https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
What this series did:
- Correct page fault accounting: we do accounting for a page fault
(no matter whether it's from #PF handling, or gup, or anything else)
only with the one that completed the fault. For example, page fault
retries should not be counted in page fault counters. Same to the
perf events.
- Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
event is used in an adhoc way across different archs.
Case (1): for many archs it's done at the entry of a page fault
handler, so that it will also cover e.g. errornous faults.
Case (2): for some other archs, it is only accounted when the page
fault is resolved successfully.
Case (3): there're still quite some archs that have not enabled
this perf event.
Since this series will touch merely all the archs, we unify this
perf event to always follow case (1), which is the one that makes most
sense. And since we moved the accounting into handle_mm_fault, the
other two MAJ/MIN perf events are well taken care of naturally.
- Unify definition of "major faults": the definition of "major
fault" is slightly changed when used in accounting (not
VM_FAULT_MAJOR). More information in patch 1.
- Always account the page fault onto the one that triggered the page
fault. This does not matter much for #PF handlings, but mostly for
gup. More information on this in patch 25.
Patchset layout:
Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25: Cleanup GUP task_struct pointer since it's not needed any more
This patch (of 25):
This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault(). This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events. To
do this, the pt_regs pointer is passed into handle_mm_fault().
PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.
So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
segment_eq is only used to implement uaccess_kernel. Just open code
uaccess_kernel in the arch uaccess headers and remove one layer of
indirection.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.
After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.
This is the preparation to set a fallback value for dev_dax->target_node.
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chuhong Yuan <hslester96@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Kaly Xin <Kaly.Xin@arm.com>
Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them and
the corresponding nodes. The slowness happens because the machines get
stuck with the smallest supported memory block size on x86 (128M), which
results in 16,288 directories to cover the 2T of installed RAM. The
search for each memory block is noticeable even with commit 4fb6eabf10
("drivers/base/memory.c: cache memory blocks in xarray to accelerate
lookup").
Commit 078eb6aa50 ("x86/mm/memory_hotplug: determine block size based on
the end of boot memory") chooses the block size based on alignment with
memory end. That addresses hotplug failures in qemu guests, but for bare
metal systems whose memory end isn't aligned to even the smallest size, it
leaves them at 128M.
Make kernels that aren't running on a hypervisor use the largest supported
size (2G) to minimize overhead on big machines. Kernel boot goes 7%
faster on the aforementioned servers, shaving off half a second.
[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IRQ bypass support for vdpa and IFC
MLX5 vdpa driver
Endian-ness fixes for virtio drivers
Misc other fixes
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl8yVEwPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpNPEH/0Dtq1s1V4r/kxtLUoMophv9wuORpWCr98BQ
2aOveTmwTOVdZVOiw2tzTgO9nbWx+cL2HvkU7Aajfpz5hh93Z2VOo2n4a7hBC79f
rlc3GXiG+pMk5RfmqGofIHTU+D6ony4D5SXlUDurLdtEwunyuqZwABiWkZjdclZJ
bv90IL8Upzbz0rxYr7k3z8UepdOCt7r4QS/o7STHZBjJRyylxmO/R2yTnh6PtpRK
Q/z35wJBJ3SKc8X3Fi0VOOSeGNZOiypkkl9ZnLVY5lExNAU1+2MMn2UK119SlCDV
MSxb7quYFF4cksXH1g77GMBNi1uADRh1dtFMZdkKhZGljGxKLxo=
=6VTZ
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- IRQ bypass support for vdpa and IFC
- MLX5 vdpa driver
- Endianness fixes for virtio drivers
- Misc other fixes
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
vdpa/mlx5: fix up endian-ness for mtu
vdpa: Fix pointer math bug in vdpasim_get_config()
vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
vdpa/mlx5: fix memory allocation failure checks
vdpa/mlx5: Fix uninitialised variable in core/mr.c
vdpa_sim: init iommu lock
virtio_config: fix up warnings on parisc
vdpa/mlx5: Add VDPA driver for supported mlx5 devices
vdpa/mlx5: Add shared memory registration code
vdpa/mlx5: Add support library for mlx5 VDPA implementation
vdpa/mlx5: Add hardware descriptive header file
vdpa: Modify get_vq_state() to return error code
net/vdpa: Use struct for set/get vq state
vdpa: remove hard coded virtq num
vdpasim: support batch updating
vhost-vdpa: support IOTLB batching hints
vhost-vdpa: support get/set backend features
vhost: generialize backend features setting/getting
vhost-vdpa: refine ioctl pre-processing
vDPA: dont change vq irq after DRIVER_OK
...
Including:
- Removal of the dev->archdata.iommu (or similar) pointers from
most architectures. Only Sparc is left, but this is private to
Sparc as their drivers don't use the IOMMU-API.
- ARM-SMMU Updates from Will Deacon:
- Support for SMMU-500 implementation in Marvell
Armada-AP806 SoC
- Support for SMMU-500 implementation in NVIDIA Tegra194 SoC
- DT compatible string updates
- Remove unused IOMMU_SYS_CACHE_ONLY flag
- Move ARM-SMMU drivers into their own subdirectory
- Intel VT-d Updates from Lu Baolu:
- Misc tweaks and fixes for vSVA
- Report/response page request events
- Cleanups
- Move the Kconfig and Makefile bits for the AMD and Intel
drivers into their respective subdirectory.
- MT6779 IOMMU Support
- Support for new chipsets in the Renesas IOMMU driver
- Other misc cleanups and fixes (e.g. to improve compile test
coverage)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl8ygTIACgkQK/BELZcB
GuPZmRAAzSLuUNoQPWrFUbocNuZ/YHUCKdluKdYx26AgtYFwBrwzDAHPdq8HF8Hm
y8w2xiUVVP9uZ8gnDkAuwXBtg+yOnG9sRNFZMNdtCy1Q0ehp0HNsn/6NabxVpSml
QuAmd2PxMMopQRVLOR5YYvZl6JdiZx19W8X+trgwnR9Kghqq+7QXI9+D00jztRxQ
Qvh/9NvIdX3k+5R4ZPJaV6OhaFvxzQzQZwKuO61VqFOWZRH1z9Oo+aXDCWTFUjYN
IClTcG8qOK2W9/SOyYDXMoz30Yf0vcuDxhafi2JJVNcTPRmMWoeqff6yKslp76ea
lTepDcIKld1Ul9NoqfYzhhKiEaLcgMEW2ua6vk5YFVxBBqJfg5qdtDZzBxa0FiNx
TQrZFX3xjtZC6tRyy+eKWOj6vx7l0ONwwDxRc3HdvL+xE+KUdmsg82qHU4cAHRjp
U2dgTdlkTEd56q4BEQxmJAHYMIUrx2QAp6pa2+Jv/Iqpi9PsZ2k+l9Gy6h+rM7dn
Est/1gA4kDhKdCKfTx7g9EL6AAoU50WttxNmwMxrUrXX3fsstfY1fKgyZUPpkL7V
V5iXbbsdMQLHzOF2qiqIIMxMGYxr/x/FJ1DnSJ7j+jAXMF77d2B9iQttzImOVN2c
VXBxcVstWN7/xXjIy13C/83bRKwWqXaaS4cbv3Di0ZGFeD2oAF0=
=3O2Z
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Remove of the dev->archdata.iommu (or similar) pointers from most
architectures. Only Sparc is left, but this is private to Sparc as
their drivers don't use the IOMMU-API.
- ARM-SMMU updates from Will Deacon:
- Support for SMMU-500 implementation in Marvell Armada-AP806 SoC
- Support for SMMU-500 implementation in NVIDIA Tegra194 SoC
- DT compatible string updates
- Remove unused IOMMU_SYS_CACHE_ONLY flag
- Move ARM-SMMU drivers into their own subdirectory
- Intel VT-d updates from Lu Baolu:
- Misc tweaks and fixes for vSVA
- Report/response page request events
- Cleanups
- Move the Kconfig and Makefile bits for the AMD and Intel drivers into
their respective subdirectory.
- MT6779 IOMMU Support
- Support for new chipsets in the Renesas IOMMU driver
- Other misc cleanups and fixes (e.g. to improve compile test coverage)
* tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits)
iommu/amd: Move Kconfig and Makefile bits down into amd directory
iommu/vt-d: Move Kconfig and Makefile bits down into intel directory
iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu
iommu: Add gfp parameter to io_pgtable_ops->map()
iommu: Mark __iommu_map_sg() as static
iommu/vt-d: Rename intel-pasid.h to pasid.h
iommu/vt-d: Add page response ops support
iommu/vt-d: Report page request faults for guest SVA
iommu/vt-d: Add a helper to get svm and sdev for pasid
iommu/vt-d: Refactor device_to_iommu() helper
iommu/vt-d: Disable multiple GPASID-dev bind
iommu/vt-d: Warn on out-of-range invalidation address
iommu/vt-d: Fix devTLB flush for vSVA
iommu/vt-d: Handle non-page aligned address
iommu/vt-d: Fix PASID devTLB invalidation
iommu/vt-d: Remove global page support in devTLB flush
iommu/vt-d: Enforce PASID devTLB field mask
iommu: Make some functions static
iommu/amd: Remove double zero check
...
With support for 32-bit pv guests gone pure pv-code no longer needs to
test for highmem. Dropping those tests removes the need for flushing
in some places.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
With 32-bit pv-guest support removed xen-asm_64.S can be merged with
xen-asm.S
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Xen is requiring 64-bit machines today and since Xen 4.14 it can be
built without 32-bit PV guest support. There is no need to carry the
burden of 32-bit PV guest support in the kernel any longer, as new
guests can be either HVM or PVH, or they can use a 64 bit kernel.
Remove the 32-bit Xen PV support from the kernel.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>