- fix PPC XIVE interrupt delivery
- fix x86 RCU breakage from asynchronous page faults when built without
PREEMPT_COUNT
- fix x86 build with -frecord-gcc-switches
- fix x86 build without X86_LOCAL_APIC
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJZ18AmAAoJEED/6hsPKofoPEwH+waDVIeS+s38G8HkiB8PoVww
bAhAV6Aj3muOI49KtwBt+qyC8nOQHpwPCNqjmagOv1GEYSwJ4gKKoJ6Xl9rOsxau
GT0xDgVDbrzIb/PTFL+7bDjsyMxf89utIfoBL8i37uznzB35+QFlvy4mLgKntAh0
1/tYDzgrQxuxH5RF4DbFstoPFjw1kdxpXRzHdngsV13bS87PAG9j7A0l7orLtXZg
qxlTh2SvCSr4B0hOZGG/Pc0aIAxLh8kRD6NaU05raKgzQLJa5sxJ0Yr+RbskfqQb
7B98X1Ygb1BjBOFxy+Je5IamKt4ICTY1B0v1ivs0qZ+mgxG59FWuQlR0pww/8Ug=
=ay5S
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
- fix PPC XIVE interrupt delivery
- fix x86 RCU breakage from asynchronous page faults when built without
PREEMPT_COUNT
- fix x86 build with -frecord-gcc-switches
- fix x86 build without X86_LOCAL_APIC
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: add X86_LOCAL_APIC dependency
x86/kvm: Move kvm_fastop_exception to .fixup section
kvm/x86: Avoid async PF preempting the kernel incorrectly
KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()
Pull watchddog clean-up and fixes from Thomas Gleixner:
"The watchdog (hard/softlockup detector) code is pretty much broken in
its current state. The patch series addresses this by removing all
duct tape and refactoring it into a workable state.
The reasons why I ask for inclusion that late in the cycle are:
1) The code causes lockdep splats vs. hotplug locking which get
reported over and over. Unfortunately there is no easy fix.
2) The risk of breakage is minimal because it's already broken
3) As 4.14 is a long term stable kernel, I prefer to have working
watchdog code in that and the lockdep issues resolved. I wouldn't
ask you to pull if 4.14 wouldn't be a LTS kernel or if the
solution would be easy to backport.
4) The series was around before the merge window opened, but then got
delayed due to the UP failure caused by the for_each_cpu()
surprise which we discussed recently.
Changes vs. V1:
- Addressed your review points
- Addressed the warning in the powerpc code which was discovered late
- Changed two function names which made sense up to a certain point
in the series. Now they match what they do in the end.
- Fixed a 'unused variable' warning, which got not detected by the
intel robot. I triggered it when trying all possible related config
combinations manually. Randconfig testing seems not random enough.
The changes have been tested by and reviewed by Don Zickus and tested
and acked by Micheal Ellerman for powerpc"
* 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
watchdog/core: Put softlockup_threads_initialized under ifdef guard
watchdog/core: Rename some softlockup_* functions
powerpc/watchdog: Make use of watchdog_nmi_probe()
watchdog/core, powerpc: Lock cpus across reconfiguration
watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
watchdog/hardlockup/perf: Cure UP damage
watchdog/hardlockup: Clean up hotplug locking mess
watchdog/hardlockup/perf: Simplify deferred event destroy
watchdog/hardlockup/perf: Use new perf CPU enable mechanism
watchdog/hardlockup/perf: Implement CPU enable replacement
watchdog/hardlockup/perf: Implement init time detection of perf
watchdog/hardlockup/perf: Implement init time perf validation
watchdog/core: Get rid of the racy update loop
watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
watchdog/sysctl: Clean up sysctl variable name space
watchdog/sysctl: Get rid of the #ifdeffery
watchdog/core: Clean up header mess
watchdog/core: Further simplify sysctl handling
watchdog/core: Get rid of the thread teardown/setup dance
...
The rework of the posted interrupt handling broke building without
support for the local APIC:
ERROR: "boot_cpu_physical_apicid" [arch/x86/kvm/kvm-intel.ko] undefined!
That configuration is probably not particularly useful anyway, so
we can avoid the randconfig failures by adding a Kconfig dependency.
Fixes: 8b306e2f3c ("KVM: VMX: avoid double list add with VT-d posted interrupts")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull networking fixes from David Miller:
1) Check iwlwifi 9000 reorder buffer out-of-space condition properly,
from Sara Sharon.
2) Fix RCU splat in qualcomm rmnet driver, from Subash Abhinov
Kasiviswanathan.
3) Fix session and tunnel release races in l2tp, from Guillaume Nault
and Sabrina Dubroca.
4) Fix endian bug in sctp_diag_dump(), from Dan Carpenter.
5) Several mlx5 driver fixes from the Mellanox folks (max flow counters
cap check, invalid memory access in IPoIB support, etc.)
6) tun_get_user() should bail if skb->len is zero, from Alexander
Potapenko.
7) Fix RCU lookups in inetpeer, from Eric Dumazet.
8) Fix locking in packet_do_bund().
9) Handle cb->start() error properly in netlink dump code, from Jason
A. Donenfeld.
10) Handle multicast properly in UDP socket early demux code. From Paolo
Abeni.
11) Several erspan bug fixes in ip_gre, from Xin Long.
12) Fix use-after-free in socket filter code, in order to handle the
fact that listener lock is no longer taken during the three-way TCP
handshake. From Eric Dumazet.
13) Fix infoleak in RTM_GETSTATS, from Nikolay Aleksandrov.
14) Fix tail call generation in x86-64 BPF JIT, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: 8021q: skip packets if the vlan is down
bpf: fix bpf_tail_call() x64 JIT
net: stmmac: dwmac-rk: Add RK3128 GMAC support
rndis_host: support Novatel Verizon USB730L
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
mlxsw: spectrum_router: Track RIF of IPIP next hops
mlxsw: spectrum_router: Move VRF refcounting
net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'
net: mvpp2: Fix clock resource by adding an optional bus clock
r8152: add Linksys USB3GIGV1 id
l2tp: fix l2tp_eth module loading
ip_gre: erspan device should keep dst
ip_gre: set tunnel hlen properly in erspan_tunnel_init
ip_gre: check packet length and mtu correctly in erspan_xmit
ip_gre: get key from session_id correctly in erspan_rcv
tipc: use only positive error codes in messages
ppp: fix __percpu annotation
udp: perform source validation for mcast early demux
IPv4: early demux can return an error code
...
When compiling the kernel with the '-frecord-gcc-switches' flag, objtool
complains:
arch/x86/kvm/emulate.o: warning: objtool: .GCC.command.line+0x0: special: can't find new instruction
And also the kernel fails to link.
The problem is that the 'kvm_fastop_exception' code gets placed into the
throwaway '.GCC.command.line' section instead of '.text'.
Exception fixup code is conventionally placed in the '.fixup' section,
so put it there where it belongs.
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
schedule() to reschedule in some cases. This could result in
accidentally ending the current RCU read-side critical section early,
causing random memory corruption in the guest, or otherwise preempting
the currently running task inside between preempt_disable and
preempt_enable.
The difficulty to handle this well is because we don't know whether an
async PF delivered in a preemptible section or RCU read-side critical section
for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
are both no-ops in that case.
To cure this, we treat any async PF interrupting a kernel context as one
that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
the schedule() path in that case.
To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
so that we know whether it's called from a context interrupting the
kernel, and the parameter is set properly in all the callsites.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
- bpf prog_array just like all other types of bpf array accepts 32-bit index.
Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent
The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40 in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40 ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a7 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Thomas Gleixner:
"This contains the following fixes and improvements:
- Avoid dereferencing an unprotected VMA pointer in the fault signal
generation code
- Fix inline asm call constraints for GCC 4.4
- Use existing register variable to retrieve the stack pointer
instead of forcing the compiler to create another indirect access
which results in excessive extra 'mov %rsp, %<dst>' instructions
- Disable branch profiling for the memory encryption code to prevent
an early boot crash
- Fix a sparse warning caused by casting the __user annotation in
__get_user_asm_u64() away
- Fix an off by one error in the loop termination of the error patch
in the x86 sysfs init code
- Add missing CPU IDs to various Intel specific drivers to enable the
functionality on recent hardware
- More (init) constification in the numachip code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Use register variable to get stack pointer value
x86/mm: Disable branch profiling in mem_encrypt.c
x86/asm: Fix inline asm call constraints for GCC 4.4
perf/x86/intel/uncore: Correct num_boxes for IIO and IRP
perf/x86/intel/rapl: Add missing CPU IDs
perf/x86/msr: Add missing CPU IDs
perf/x86/intel/cstate: Add missing CPU IDs
x86: Don't cast away the __user in __get_user_asm_u64()
x86/sysfs: Fix off-by-one error in loop termination
x86/mm: Fix fault error path using unsafe vma pointer
x86/numachip: Add const and __initconst to numachip2_clockevent
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJZzmQ4AAoJELDendYovxMvL1oIAIBiL7SCEX4mlCjNBYBGw8N+
4pcYUcKPu07JeAQGC7SOEjcwjrSUw+b6NJIZLCHcPAG/JyejwBAbuztmRgqsIqN9
sVa/7GjecigsE+Jw3gT1OHDxxLMsyk2pa+poeTVdjjqFNOGRzWhG3D5dZGgOUMkF
o8KaPgh2jyA2rg6SnxEDXy9aEpDFOO6Yb9cxApwdC+Y399zPEdqauEzFunxzIoa+
S155tI9rr2HcXUp/DxAk/C6PaSmKfEszuKKyvvjFE8latHCaUEJ+HLacURuJUu7C
pEc2gOTOo4dkYyDLLIQeCyGbRnH4B1GF9cv0vF//1gfAJzVGtJxmwj1qlezVDCQ=
=jfCe
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.14c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- avoid a warning when compiling with clang
- consider read-only bits in xen-pciback when writing to a BAR
- fix a boot crash of pv-domains
* tag 'for-linus-4.14c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping
xen-pciback: relax BAR sizing write value check
x86/xen: clean up clang build warning
that was finally triggered by PCID support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJZzmJqAAoJEL/70l94x66D99oH/R4hOMfzDxFOgW3LnaCQJvwo
n1+tH3as0dfdkpggZ+UmJuKnbVJ0625+qozenrdYkKtk1YiyIIQWG3vdsz4HBfzp
CYK2NVVymf0dg8DQaluz6iB1R28ke12PggzyFv01s1QyENBDA8J38pslZarPM2OA
tnpRKC6B59/VmRD0PWS6yRmTXY+HfzWlWg4JMraq2VdybbEXJhh8BNfjjNn30DkZ
SW8kHq60AUd5Arhb3cmiPiXZCQ7odqF2u2mEcBmnA9hAacaGEheSzKCUOaEIjmZV
5/jTyG1tZkN7CbrG81ryuoa8A6qTOSyHxo1QkzAmE/p+s2IzGfzzLqmtfIsAWkE=
=1lM1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Mixed bugfixes. Perhaps the most interesting one is a latent bug that
was finally triggered by PCID support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm/x86: Handle async PF in RCU read-side critical sections
KVM: nVMX: Fix nested #PF intends to break L1's vmlauch/vmresume
KVM: VMX: use cmpxchg64
KVM: VMX: simplify and fix vmx_vcpu_pi_load
KVM: VMX: avoid double list add with VT-d posted interrupts
KVM: VMX: extract __pi_post_block
KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception
KVM: nVMX: fix HOST_CR3/HOST_CR4 cache
Currently we use current_stack_pointer() function to get the value
of the stack pointer register. Since commit:
f5caf621ee ("x86/asm: Fix inline asm call constraints for Clang")
... we have a stack register variable declared. It can be used instead of
current_stack_pointer() function which allows to optimize away some
excessive "mov %rsp, %<dst>" instructions:
-mov %rsp,%rdx
-sub %rdx,%rax
-cmp $0x3fff,%rax
-ja ffffffff810722fd <ist_begin_non_atomic+0x2d>
+sub %rsp,%rax
+cmp $0x3fff,%rax
+ja ffffffff810722fa <ist_begin_non_atomic+0x2a>
Remove current_stack_pointer(), rename __asm_call_sp to current_stack_pointer
and use it instead of the removed function.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170929141537.29167-1-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some routines in mem_encrypt.c are called very early in the boot process,
e.g. sme_encrypt_kernel(). When CONFIG_TRACE_BRANCH_PROFILING=y is defined
the resulting branch profiling associated with the check to see if SME is
active results in a kernel crash. Disable branch profiling for
mem_encrypt.c by defining DISABLE_BRANCH_PROFILING before including any
header files.
Reported-by: kernel test robot <lkp@01.org>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170929162419.6016.53390.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
------------[ cut here ]------------
WARNING: CPU: 4 PID: 5280 at /home/kernel/linux/arch/x86/kvm//vmx.c:11394 nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
CPU: 4 PID: 5280 Comm: qemu-system-x86 Tainted: G W OE 4.13.0+ #17
RIP: 0010:nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
Call Trace:
? emulator_read_emulated+0x15/0x20 [kvm]
? segmented_read+0xae/0xf0 [kvm]
vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
? vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
x86_emulate_instruction+0x733/0x810 [kvm]
vmx_handle_exit+0x2f4/0xda0 [kvm_intel]
? kvm_arch_vcpu_ioctl_run+0xd2f/0x1c60 [kvm]
kvm_arch_vcpu_ioctl_run+0xdab/0x1c60 [kvm]
? kvm_arch_vcpu_load+0x62/0x230 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
A nested #PF is triggered during L0 emulating instruction for L2. However, it
doesn't consider we should not break L1's vmlauch/vmresme. This patch fixes
it by queuing the #PF exception instead ,requesting an immediate VM exit from
L2 and keeping the exception for L1 pending for a subsequent nested VM exit.
This should actually work all the time, making vmx_inject_page_fault_nested
totally unnecessary. However, that's not working yet, so this patch can work
around the issue in the meanwhile.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kernel test bot (run by Xiaolong Ye) reported that the following commit:
f5caf621ee ("x86/asm: Fix inline asm call constraints for Clang")
is causing double faults in a kernel compiled with GCC 4.4.
Linus subsequently diagnosed the crash pattern and the buggy commit and found that
the issue is with this code:
register unsigned int __asm_call_sp asm("esp");
#define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp)
Even on a 64-bit kernel, it's using ESP instead of RSP. That causes GCC
to produce the following bogus code:
ffffffff8147461d: 89 e0 mov %esp,%eax
ffffffff8147461f: 4c 89 f7 mov %r14,%rdi
ffffffff81474622: 4c 89 fe mov %r15,%rsi
ffffffff81474625: ba 20 00 00 00 mov $0x20,%edx
ffffffff8147462a: 89 c4 mov %eax,%esp
ffffffff8147462c: e8 bf 52 05 00 callq ffffffff814c98f0 <copy_user_generic_unrolled>
Despite the absurdity of it backing up and restoring the stack pointer
for no reason, the bug is actually the fact that it's only backing up
and restoring the lower 32 bits of the stack pointer. The upper 32 bits
are getting cleared out, corrupting the stack pointer.
So change the '__asm_call_sp' register variable to be associated with
the actual full-size stack pointer.
This also requires changing the __ASM_SEL() macro to be based on the
actual compiled arch size, rather than the CONFIG value, because
CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso).
Otherwise Clang fails to build the kernel because it complains about the
use of a 64-bit register (RSP) in a 32-bit file.
Reported-and-Bisected-and-Tested-by: kernel test robot <xiaolong.ye@intel.com>
Diagnosed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f5caf621ee ("x86/asm: Fix inline asm call constraints for Clang")
Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When bootup a PVM guest with large memory(Ex.240GB), XEN provided initial
mapping overlaps with kernel module virtual space. When mapping in this space
is cleared by xen_cleanhighmap(), in certain case there could be an 2MB mapping
left. This is due to XEN initialize 4MB aligned mapping but xen_cleanhighmap()
finish at 2MB boundary.
When module loading is just on top of the 2MB space, got below warning:
WARNING: at mm/vmalloc.c:106 vmap_pte_range+0x14e/0x190()
Call Trace:
[<ffffffff81117083>] warn_alloc_failed+0xf3/0x160
[<ffffffff81146022>] __vmalloc_area_node+0x182/0x1c0
[<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
[<ffffffff81145df7>] __vmalloc_node_range+0xa7/0x110
[<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
[<ffffffff8103ca54>] module_alloc+0x64/0x70
[<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
[<ffffffff810ac91e>] module_alloc_update_bounds+0x1e/0x80
[<ffffffff810ac9a7>] move_module+0x27/0x150
[<ffffffff810aefa0>] layout_and_allocate+0x120/0x1b0
[<ffffffff810af0a8>] load_module+0x78/0x640
[<ffffffff811ff90b>] ? security_file_permission+0x8b/0x90
[<ffffffff810af6d2>] sys_init_module+0x62/0x1e0
[<ffffffff815154c2>] system_call_fastpath+0x16/0x1b
Then the mapping of 2MB is cleared, finally oops when the page in that space is
accessed.
BUG: unable to handle kernel paging request at ffff880022600000
IP: [<ffffffff81260877>] clear_page_c_e+0x7/0x10
PGD 1788067 PUD 178c067 PMD 22434067 PTE 0
Oops: 0002 [#1] SMP
Call Trace:
[<ffffffff81116ef7>] ? prep_new_page+0x127/0x1c0
[<ffffffff81117d42>] get_page_from_freelist+0x1e2/0x550
[<ffffffff81133010>] ? ii_iovec_copy_to_user+0x90/0x140
[<ffffffff81119c9d>] __alloc_pages_nodemask+0x12d/0x230
[<ffffffff81155516>] alloc_pages_vma+0xc6/0x1a0
[<ffffffff81006ffd>] ? pte_mfn_to_pfn+0x7d/0x100
[<ffffffff81134cfb>] do_anonymous_page+0x16b/0x350
[<ffffffff81139c34>] handle_pte_fault+0x1e4/0x200
[<ffffffff8100712e>] ? xen_pmd_val+0xe/0x10
[<ffffffff810052c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff81139dab>] handle_mm_fault+0x15b/0x270
[<ffffffff81510c10>] do_page_fault+0x140/0x470
[<ffffffff8150d7d5>] page_fault+0x25/0x30
Call xen_cleanhighmap() with 4MB aligned for page tables mapping to fix it.
The unnecessory call of xen_cleanhighmap() in DEBUG mode is also removed.
-v2: add comment about XEN alignment from Juergen.
References: https://lists.xen.org/archives/html/xen-devel/2012-07/msg01562.html
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
[boris: added 'xen/mmu' tag to commit subject]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The simplify part: do not touch pi_desc.nv, we can set it when the
VCPU is first created. Likewise, pi_desc.sn is only handled by
vmx_vcpu_pi_load, do not touch it in __pi_post_block.
The fix part: do not check kvm_arch_has_assigned_device, instead
check the SN bit to figure out whether vmx_vcpu_pi_put ran before.
This matches what the previous patch did in pi_post_block.
Cc: Huangweidong <weidong.huang@huawei.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Cc: wangxin <wangxinxin.wang@huawei.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Longpeng (Mike) <longpeng2@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In some cases, for example involving hot-unplug of assigned
devices, pi_post_block can forget to remove the vCPU from the
blocked_vcpu_list. When this happens, the next call to
pi_pre_block corrupts the list.
Fix this in two ways. First, check vcpu->pre_pcpu in pi_pre_block
and WARN instead of adding the element twice in the list. Second,
always do the list removal in pi_post_block if vcpu->pre_pcpu is
set (not -1).
The new code keeps interrupts disabled for the whole duration of
pi_pre_block/pi_post_block. This is not strictly necessary, but
easier to follow. For the same reason, PI.ON is checked only
after the cmpxchg, and to handle it we just call the post-block
code. This removes duplication of the list removal code.
Cc: Huangweidong <weidong.huang@huawei.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Cc: wangxin <wangxinxin.wang@huawei.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Longpeng (Mike) <longpeng2@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is in preparation to verify the full xstate header as supplied by user-space.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170924105913.9157-8-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is in preparation to verify the full xstate header as supplied by user-space.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170924105913.9157-5-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move validation of user-supplied xstate_header into a helper function,
in preparation of calling it from both the ptrace and sigreturn syscall
paths.
The new function also considers it to be an error if *any* reserved bits
are set, whereas before we were just clearing most of them silently.
This should reduce the chance of bugs that fail to correctly validate
user-supplied XSAVE areas. It also will expose any broken userspace
programs that set the other reserved bits; this is desirable because
such programs will lose compatibility with future CPUs and kernels if
those bits are ever used for anything. (There shouldn't be any such
programs, and in fact in the case where the compacted format is in use
we were already validating xfeatures. But you never know...)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170924105913.9157-2-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As per the new nomenclature we don't 'activate' the FPU state
anymore, we initialize it. So drop the _activate_fpstate name
from these functions, which were a bit of a mouthful anyway,
and name them:
fpu__prepare_read()
fpu__prepare_write()
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename this function to better express that it's all about
initializing the FPU state of a task which goes hand in hand
with the fpu::initialized field.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-33-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__copy() has a preempt_disable()/enable() pair, which it had to do to
be able to atomically unlazy the current task when doing an FNSAVE.
But we don't unlazy tasks anymore, we always do direct saves/restores of
FPU context.
So remove both the unnecessary critical section, and update the comments.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-32-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We don't do any lazy restore anymore, what we have are two pieces of optimization:
- no-FPU tasks that don't save/restore the FPU context (kernel threads are such)
- cached FPU registers maintained via the fpu->last_cpu field. This means that
if an FPU task context switches to a non-FPU task then we can maintain the
FPU registers as an in-FPU copies (cache), and skip the restoration of them
once we switch back to the original FPU-using task.
Update all the comments that still referred to old 'lazy' and 'unlazy' concepts.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-31-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x86 FPU code used to have a complex state machine where both the FPU
registers and the FPU state context could be 'active' (or inactive)
independently of each other - which enabled features like lazy FPU restore.
Much of this complexity is gone in the current code: now we basically can
have FPU-less tasks (kernel threads) that don't use (and save/restore) FPU
state at all, plus full FPU users that save/restore directly with no laziness
whatsoever.
But the fpu::fpstate_active still carries bits of the old complexity - meanwhile
this flag has become a simple flag that shows whether the FPU context saving
area in the thread struct is initialized and used, or not.
Rename it to fpu::initialized to express this simplicity in the name as well.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-30-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__activate_fpstate_read() can be called for the current task
when coredumping - or for stopped tasks when ptrace-ing.
Implement this properly in the code and update the comments.
This also fixes an incorrect (but harmless) warning introduced by
one of the earlier patches.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-28-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An off-by-one error in loop terminantion conditions in
create_setup_data_nodes() will lead to memory leak when
create_setup_data_node() failed.
Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1505090001-1157-1-git-send-email-fxinrong@gmail.com
commit 7b2d0dbac4 ("x86/mm/pkeys: Pass VMA down in to fault signal
generation code") passes down a vma pointer to the error path, but that is
done once the mmap_sem is released when calling mm_fault_error() from
__do_page_fault().
This is dangerous as the vma structure is no more safe to be used once the
mmap_sem has been released. As only the protection key value is required in
the error processing, we could just pass down this value.
Fix it by passing a pointer to a protection key value down to the fault
signal generation code. The use of a pointer allows to keep the check
generating a warning message in fill_sig_info_pkey() when the vma was not
known. If the pointer is valid, the protection value can be accessed by
deferencing the pointer.
[ tglx: Made *pkey u32 as that's the type which is passed in siginfo ]
Fixes: 7b2d0dbac4 ("x86/mm/pkeys: Pass VMA down in to fault signal generation code")
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1504513935-12742-1-git-send-email-ldufour@linux.vnet.ibm.com
Userspace can change the FPU state of a task using the ptrace() or
rt_sigreturn() system calls. Because reserved bits in the FPU state can
cause the XRSTOR instruction to fail, the kernel has to carefully
validate that no reserved bits or other invalid values are being set.
Unfortunately, there have been bugs in this validation code. For
example, we were not checking that the 'xcomp_bv' field in the
xstate_header was 0. As-is, such bugs are exploitable to read the FPU
registers of other processes on the system. To do so, an attacker can
create a task, assign to it an invalid FPU state, then spin in a loop
and monitor the values of the FPU registers. Because the task's FPU
registers are not being restored, sometimes the FPU registers will have
the values from another process.
This is likely to continue to be a problem in the future because the
validation done by the CPU instructions like XRSTOR is not immediately
visible to kernel developers. Nor will invalid FPU states ever be
encountered during ordinary use --- they will only be seen during
fuzzing or exploits. There can even be reserved bits outside the
xstate_header which are easy to forget about. For example, the MXCSR
register contains reserved bits, which were not validated by the
KVM_SET_XSAVE ioctl until commit a575813bfe ("KVM: x86: Fix load
damaged SSEx MXCSR register").
Therefore, mitigate this class of vulnerability by restoring the FPU
registers from init_fpstate if restoring from the task's state fails.
We actually used to do this, but it was (perhaps unwisely) removed by
commit 9ccc27a5d2 ("x86/fpu: Remove error return values from
copy_kernel_to_*regs() functions"). This new patch is also a bit
different. First, it only clears the registers, not also the bad
in-memory state; this is simpler and makes it easier to make the
mitigation cover all callers of __copy_kernel_to_fpregs(). Second, it
does the register clearing in an exception handler so that no extra
instructions are added to context switches. In fact, we *remove*
instructions, since previously we were always zeroing the register
containing 'err' even if CONFIG_X86_DEBUG_FPU was disabled.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170922174156.16780-4-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20170923130016.21448-27-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86, userspace can use the ptrace() or rt_sigreturn() system calls to
set a task's extended state (xstate) or "FPU" registers. ptrace() can
set them for another task using the PTRACE_SETREGSET request with
NT_X86_XSTATE, while rt_sigreturn() can set them for the current task.
In either case, registers can be set to any value, but the kernel
assumes that the XSAVE area itself remains valid in the sense that the
CPU can restore it.
However, in the case where the kernel is using the uncompacted xstate
format (which it does whenever the XSAVES instruction is unavailable),
it was possible for userspace to set the xcomp_bv field in the
xstate_header to an arbitrary value. However, all bits in that field
are reserved in the uncompacted case, so when switching to a task with
nonzero xcomp_bv, the XRSTOR instruction failed with a #GP fault. This
caused the WARN_ON_FPU(err) in copy_kernel_to_xregs() to be hit. In
addition, since the error is otherwise ignored, the FPU registers from
the task previously executing on the CPU were leaked.
Fix the bug by checking that the user-supplied value of xcomp_bv is 0 in
the uncompacted case, and returning an error otherwise.
The reason for validating xcomp_bv rather than simply overwriting it
with 0 is that we want userspace to see an error if it (incorrectly)
provides an XSAVE area in compacted format rather than in uncompacted
format.
Note that as before, in case of error we clear the task's FPU state.
This is perhaps non-ideal, especially for PTRACE_SETREGSET; it might be
better to return an error before changing anything. But it seems the
"clear on error" behavior is fine for now, and it's a little tricky to
do otherwise because it would mean we couldn't simply copy the full
userspace state into kernel memory in one __copy_from_user().
This bug was found by syzkaller, which hit the above-mentioned
WARN_ON_FPU():
WARNING: CPU: 1 PID: 0 at ./arch/x86/include/asm/fpu/internal.h:373 __switch_to+0x5b5/0x5d0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.13.0 #453
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff9ba2bc8e42c0 task.stack: ffffa78cc036c000
RIP: 0010:__switch_to+0x5b5/0x5d0
RSP: 0000:ffffa78cc08bbb88 EFLAGS: 00010082
RAX: 00000000fffffffe RBX: ffff9ba2b8bf2180 RCX: 00000000c0000100
RDX: 00000000ffffffff RSI: 000000005cb10700 RDI: ffff9ba2b8bf36c0
RBP: ffffa78cc08bbbd0 R08: 00000000929fdf46 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ba2bc8e42c0
R13: 0000000000000000 R14: ffff9ba2b8bf3680 R15: ffff9ba2bf5d7b40
FS: 00007f7e5cb10700(0000) GS:ffff9ba2bf400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004005cc CR3: 0000000079fd5000 CR4: 00000000001406e0
Call Trace:
Code: 84 00 00 00 00 00 e9 11 fd ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 e7 fa ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 c2 fa ff ff <0f> ff 66 0f 1f 84 00 00 00 00 00 e9 d4 fc ff ff 66 66 2e 0f 1f
Here is a C reproducer. The expected behavior is that the program spin
forever with no output. However, on a buggy kernel running on a
processor with the "xsave" feature but without the "xsaves" feature
(e.g. Sandy Bridge through Broadwell for Intel), within a second or two
the program reports that the xmm registers were corrupted, i.e. were not
restored correctly. With CONFIG_X86_DEBUG_FPU=y it also hits the above
kernel warning.
#define _GNU_SOURCE
#include <stdbool.h>
#include <inttypes.h>
#include <linux/elf.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/uio.h>
#include <sys/wait.h>
#include <unistd.h>
int main(void)
{
int pid = fork();
uint64_t xstate[512];
struct iovec iov = { .iov_base = xstate, .iov_len = sizeof(xstate) };
if (pid == 0) {
bool tracee = true;
for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN) && tracee; i++)
tracee = (fork() != 0);
uint32_t xmm0[4] = { [0 ... 3] = tracee ? 0x00000000 : 0xDEADBEEF };
asm volatile(" movdqu %0, %%xmm0\n"
" mov %0, %%rbx\n"
"1: movdqu %%xmm0, %0\n"
" mov %0, %%rax\n"
" cmp %%rax, %%rbx\n"
" je 1b\n"
: "+m" (xmm0) : : "rax", "rbx", "xmm0");
printf("BUG: xmm registers corrupted! tracee=%d, xmm0=%08X%08X%08X%08X\n",
tracee, xmm0[0], xmm0[1], xmm0[2], xmm0[3]);
} else {
usleep(100000);
ptrace(PTRACE_ATTACH, pid, 0, 0);
wait(NULL);
ptrace(PTRACE_GETREGSET, pid, NT_X86_XSTATE, &iov);
xstate[65] = -1;
ptrace(PTRACE_SETREGSET, pid, NT_X86_XSTATE, &iov);
ptrace(PTRACE_CONT, pid, 0, 0);
wait(NULL);
}
return 1;
}
Note: the program only tests for the bug using the ptrace() system call.
The bug can also be reproduced using the rt_sigreturn() system call, but
only when called from a 32-bit program, since for 64-bit programs the
kernel restores the FPU state from the signal frame by doing XRSTOR
directly from userspace memory (with proper error checking).
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org> [v3.17+]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 0b29643a58 ("x86/xsaves: Change compacted format xsave area header")
Link: http://lkml.kernel.org/r/20170922174156.16780-2-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20170923130016.21448-25-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Another round of CR3/PCID related fixes (I think this addresses all
but one of the known problems with PCID support), an objtool fix plus
a Clang fix that (finally) solves all Clang quirks to build a bootable
x86 kernel as-is"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Fix inline asm call constraints for Clang
objtool: Handle another GCC stack pointer adjustment bug
x86/mm/32: Load a sane CR3 before cpu_init() on secondary CPUs
x86/mm/32: Move setup_clear_cpu_cap(X86_FEATURE_PCID) earlier
x86/mm/64: Stop using CR3.PCID == 0 in ASID-aware code
x86/mm: Factor out CR3-building code
copy_xregs_to_kernel checks if the alternatives have been already
patched.
This WARN_ON() is always executed in every context switch.
All the other checks in fpu internal.h are WARN_ON_FPU(), but
this one is plain WARN_ON(). I assume it was forgotten to switch it.
So switch it to WARN_ON_FPU() too to avoid some unnecessary code
in the context switch, and a potentially expensive cache line miss for the
global variable.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170329062605.4970-1-andi@firstfloor.org
Link: http://lkml.kernel.org/r/20170923130016.21448-24-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>