linux/arch/x86/include/asm/vdso/gettimeofday.h

323 lines
8.1 KiB
C
Raw Normal View History

/* SPDX-License-Identifier: GPL-2.0 */
/*
* Fast user context implementation of clock_gettime, gettimeofday, and time.
*
* Copyright (C) 2019 ARM Limited.
* Copyright 2006 Andi Kleen, SUSE Labs.
* 32 Bit compat layer by Stefani Seibold <stefani@seibold.net>
* sponsored by Rohde & Schwarz GmbH & Co. KG Munich/Germany
*/
#ifndef __ASM_VDSO_GETTIMEOFDAY_H
#define __ASM_VDSO_GETTIMEOFDAY_H
#ifndef __ASSEMBLY__
#include <uapi/linux/time.h>
#include <asm/vgtod.h>
#include <asm/vvar.h>
#include <asm/unistd.h>
#include <asm/msr.h>
#include <asm/pvclock.h>
clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic Continue consolidating Hyper-V clock and timer code into an ISA independent Hyper-V clocksource driver. Move the existing clocksource code under drivers/hv and arch/x86 to the new clocksource driver while separating out the ISA dependencies. Update Hyper-V initialization to call initialization and cleanup routines since the Hyper-V synthetic clock is not independently enumerated in ACPI. Update Hyper-V clocksource users in KVM and VDSO to get definitions from the new include file. No behavior is changed and no new functionality is added. Suggested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "bp@alien8.de" <bp@alien8.de> Cc: "will.deacon@arm.com" <will.deacon@arm.com> Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com> Cc: "mark.rutland@arm.com" <mark.rutland@arm.com> Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org> Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org> Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org> Cc: "olaf@aepfle.de" <olaf@aepfle.de> Cc: "apw@canonical.com" <apw@canonical.com> Cc: "jasowang@redhat.com" <jasowang@redhat.com> Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com> Cc: Sunil Muthuswamy <sunilmut@microsoft.com> Cc: KY Srinivasan <kys@microsoft.com> Cc: "sashal@kernel.org" <sashal@kernel.org> Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com> Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org> Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org> Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org> Cc: "arnd@arndb.de" <arnd@arndb.de> Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk> Cc: "ralf@linux-mips.org" <ralf@linux-mips.org> Cc: "paul.burton@mips.com" <paul.burton@mips.com> Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org> Cc: "salyzyn@android.com" <salyzyn@android.com> Cc: "pcc@google.com" <pcc@google.com> Cc: "shuah@kernel.org" <shuah@kernel.org> Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com> Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk> Cc: "huw@codeweavers.com" <huw@codeweavers.com> Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au> Cc: "pbonzini@redhat.com" <pbonzini@redhat.com> Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com> Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org> Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
2019-07-01 04:26:06 +00:00
#include <clocksource/hyperv_timer.h>
#define __vdso_data (VVAR(_vdso_data))
x86/vdso: Add time napespace page To support time namespaces in the VDSO with a minimal impact on regular non time namespace affected tasks, the namespace handling needs to be hidden in a slow path. The most obvious place is vdso_seq_begin(). If a task belongs to a time namespace then the VVAR page which contains the system wide VDSO data is replaced with a namespace specific page which has the same layout as the VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace handling path. The extra check in the case that vdso_data->seq is odd, e.g. a concurrent update of the VDSO data is in progress, is not really affecting regular tasks which are not part of a time namespace as the task is spin waiting for the update to finish and vdso_data->seq to become even again. If a time namespace task hits that code path, it invokes the corresponding time getter function which retrieves the real VVAR page, reads host time and then adds the offset for the requested clock which is stored in the special VVAR page. Allocate the time namespace page among VVAR pages and place vdso_data on it. Provide __arch_get_timens_vdso_data() helper for VDSO code to get the code-relative position of VVARs on that special page. Co-developed-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191112012724.250792-23-dima@arista.com
2019-11-12 01:27:11 +00:00
#define __timens_vdso_data (TIMENS(_vdso_data))
#define VDSO_HAS_TIME 1
#define VDSO_HAS_CLOCK_GETRES 1
/*
* Declare the memory-mapped vclock data pages. These come from hypervisors.
* If we ever reintroduce something like direct access to an MMIO clock like
* the HPET again, it will go here as well.
*
* A load from any of these pages will segfault if the clock in question is
* disabled, so appropriate compiler barriers and checks need to be used
* to prevent stray loads.
*
* These declarations MUST NOT be const. The compiler will assume that
* an extern const variable has genuinely constant contents, and the
* resulting code won't work, since the whole point is that these pages
* change over time, possibly while we're accessing them.
*/
#ifdef CONFIG_PARAVIRT_CLOCK
/*
* This is the vCPU 0 pvclock page. We only use pvclock from the vDSO
* if the hypervisor tells us that all vCPUs can get valid data from the
* vCPU 0 page.
*/
extern struct pvclock_vsyscall_time_info pvclock_page
__attribute__((visibility("hidden")));
#endif
#ifdef CONFIG_HYPERV_TIMER
extern struct ms_hyperv_tsc_page hvclock_page
__attribute__((visibility("hidden")));
#endif
x86/vdso: Add time napespace page To support time namespaces in the VDSO with a minimal impact on regular non time namespace affected tasks, the namespace handling needs to be hidden in a slow path. The most obvious place is vdso_seq_begin(). If a task belongs to a time namespace then the VVAR page which contains the system wide VDSO data is replaced with a namespace specific page which has the same layout as the VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace handling path. The extra check in the case that vdso_data->seq is odd, e.g. a concurrent update of the VDSO data is in progress, is not really affecting regular tasks which are not part of a time namespace as the task is spin waiting for the update to finish and vdso_data->seq to become even again. If a time namespace task hits that code path, it invokes the corresponding time getter function which retrieves the real VVAR page, reads host time and then adds the offset for the requested clock which is stored in the special VVAR page. Allocate the time namespace page among VVAR pages and place vdso_data on it. Provide __arch_get_timens_vdso_data() helper for VDSO code to get the code-relative position of VVARs on that special page. Co-developed-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191112012724.250792-23-dima@arista.com
2019-11-12 01:27:11 +00:00
#ifdef CONFIG_TIME_NS
static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(void)
{
return __timens_vdso_data;
}
#endif
#ifndef BUILD_VDSO32
static __always_inline
long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;
asm ("syscall" : "=a" (ret), "=m" (*_ts) :
"0" (__NR_clock_gettime), "D" (_clkid), "S" (_ts) :
"rcx", "r11");
return ret;
}
static __always_inline
long gettimeofday_fallback(struct __kernel_old_timeval *_tv,
struct timezone *_tz)
{
long ret;
asm("syscall" : "=a" (ret) :
"0" (__NR_gettimeofday), "D" (_tv), "S" (_tz) : "memory");
return ret;
}
static __always_inline
long clock_getres_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;
asm ("syscall" : "=a" (ret), "=m" (*_ts) :
"0" (__NR_clock_getres), "D" (_clkid), "S" (_ts) :
"rcx", "r11");
return ret;
}
#else
static __always_inline
long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_gettime64), [clock] "g" (_clkid), "c" (_ts)
: "edx");
return ret;
}
static __always_inline
long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
{
long ret;
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_gettime), [clock] "g" (_clkid), "c" (_ts)
: "edx");
return ret;
}
static __always_inline
long gettimeofday_fallback(struct __kernel_old_timeval *_tv,
struct timezone *_tz)
{
long ret;
asm(
"mov %%ebx, %%edx \n"
"mov %2, %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret)
: "0" (__NR_gettimeofday), "g" (_tv), "c" (_tz)
: "memory", "edx");
return ret;
}
static __always_inline long
clock_getres_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_getres_time64), [clock] "g" (_clkid), "c" (_ts)
: "edx");
return ret;
}
static __always_inline
long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
{
long ret;
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_getres), [clock] "g" (_clkid), "c" (_ts)
: "edx");
return ret;
}
#endif
#ifdef CONFIG_PARAVIRT_CLOCK
static u64 vread_pvclock(void)
{
const struct pvclock_vcpu_time_info *pvti = &pvclock_page.pvti;
u32 version;
u64 ret;
/*
* Note: The kernel and hypervisor must guarantee that cpu ID
* number maps 1:1 to per-CPU pvclock time info.
*
* Because the hypervisor is entirely unaware of guest userspace
* preemption, it cannot guarantee that per-CPU pvclock time
* info is updated if the underlying CPU changes or that that
* version is increased whenever underlying CPU changes.
*
* On KVM, we are guaranteed that pvti updates for any vCPU are
* atomic as seen by *all* vCPUs. This is an even stronger
* guarantee than we get with a normal seqlock.
*
* On Xen, we don't appear to have that guarantee, but Xen still
* supplies a valid seqlock using the version field.
*
* We only do pvclock vdso timing at all if
* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
* mean that all vCPUs have matching pvti and that the TSC is
* synced, so we can just look at vCPU 0's pvti.
*/
do {
version = pvclock_read_begin(pvti);
if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
return U64_MAX;
ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
} while (pvclock_read_retry(pvti, version));
return ret;
}
#endif
#ifdef CONFIG_HYPERV_TIMER
static u64 vread_hvclock(void)
{
return hv_read_tsc_page(&hvclock_page);
}
#endif
static inline u64 __arch_get_hw_counter(s32 clock_mode,
const struct vdso_data *vd)
{
if (likely(clock_mode == VDSO_CLOCKMODE_TSC))
return (u64)rdtsc_ordered();
/*
* For any memory-mapped vclock type, we need to make sure that gcc
* doesn't cleverly hoist a load before the mode check. Otherwise we
* might end up touching the memory-mapped page even if the vclock in
* question isn't enabled, which will segfault. Hence the barriers.
*/
#ifdef CONFIG_PARAVIRT_CLOCK
if (clock_mode == VDSO_CLOCKMODE_PVCLOCK) {
barrier();
return vread_pvclock();
}
#endif
#ifdef CONFIG_HYPERV_TIMER
if (clock_mode == VDSO_CLOCKMODE_HVCLOCK) {
barrier();
return vread_hvclock();
}
#endif
return U64_MAX;
}
static __always_inline const struct vdso_data *__arch_get_vdso_data(void)
{
return __vdso_data;
}
x86/vdso: Unbreak paravirt VDSO clocks The conversion of x86 VDSO to the generic clock mode storage broke the paravirt and hyperv clocksource logic. These clock sources have their own internal sequence counter to validate the clocksource at the point of reading it. This is necessary because the hypervisor can invalidate the clocksource asynchronously so a check during the VDSO data update is not sufficient. If the internal check during read invalidates the clocksource the read return U64_MAX. The original code checked this efficiently by testing whether the result (casted to signed) is negative, i.e. bit 63 is set. This was done that way because an extra indicator for the validity had more overhead. The conversion broke this check because the check was replaced by a check for a valid VDSO clock mode. The wreckage manifests itself when the paravirt clock is installed as a valid VDSO clock and during runtime invalidated by the hypervisor, e.g. after a host suspend/resume cycle. After the invalidation the read function returns U64_MAX which is used as cycles and makes the clock jump by ~2200 seconds, and become stale until the 2200 seconds have elapsed where it starts to jump again. The period of this effect depends on the shift/mult pair of the clocksource and the jumps and staleness are an artifact of undefined but reproducible behaviour of math overflow. Implement an x86 version of the new vdso_cycles_ok() inline which adds this check back and a variant of vdso_clocksource_ok() which lets the compiler optimize it out to avoid the extra conditional. That's suboptimal when the system does not have a VDSO capable clocksource, but that's not the case which is optimized for. Fixes: 5d51bee725cc ("clocksource: Add common vdso clock mode storage") Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Miklos Szeredi <mszeredi@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200606221532.080560273@linutronix.de
2020-06-06 21:51:17 +00:00
static inline bool arch_vdso_clocksource_ok(const struct vdso_data *vd)
{
return true;
}
#define vdso_clocksource_ok arch_vdso_clocksource_ok
/*
* Clocksource read value validation to handle PV and HyperV clocksources
* which can be invalidated asynchronously and indicate invalidation by
* returning U64_MAX, which can be effectively tested by checking for a
* negative value after casting it to s64.
*/
static inline bool arch_vdso_cycles_ok(u64 cycles)
{
return (s64)cycles >= 0;
}
#define vdso_cycles_ok arch_vdso_cycles_ok
lib/vdso: Make delta calculation work correctly The x86 vdso implementation on which the generic vdso library is based on has subtle (unfortunately undocumented) twists: 1) The code assumes that the clocksource mask is U64_MAX which means that no bits are masked. Which is true for any valid x86 VDSO clocksource. Stupidly it still did the mask operation for no reason and at the wrong place right after reading the clocksource. 2) It contains a sanity check to catch the case where slightly unsynchronized TSC values can be observed which would cause the delta calculation to make a huge jump. It therefore checks whether the current TSC value is larger than the value on which the current conversion is based on. If it's not larger the base value is used to prevent time jumps. #1 Is not only stupid for the X86 case because it does the masking for no reason it is also completely wrong for clocksources with a smaller mask which can legitimately wrap around during a conversion period. The core timekeeping code does it correct by applying the mask after the delta calculation: (now - base) & mask #2 is equally broken for clocksources which have smaller masks and can wrap around during a conversion period because there the now > base check is just wrong and causes stale time stamps and time going backwards issues. Unbreak it by: 1) Removing the mask operation from the clocksource read which makes the fallback detection work for all clocksources 2) Replacing the conditional delta calculation with a overrideable inline function. #2 could reuse clocksource_delta() from the timekeeping code but that results in a significant performance hit for the x86 VSDO. The timekeeping core code must have the non optimized version as it has to operate correctly with clocksources which have smaller masks as well to handle the case where TSC is discarded as timekeeper clocksource and replaced by HPET or pmtimer. For the VDSO there is no replacement clocksource. If TSC is unusable the syscall is enforced which does the right thing. To accommodate to the needs of various architectures provide an override-able inline function which defaults to the regular delta calculation with masking: (now - base) & mask Override it for x86 with the non-masking and checking version. This unbreaks the ARM64 syscall fallback operation, allows to use clocksources with arbitrary width and preserves the performance optimization for x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: linux-arch@vger.kernel.org Cc: LAK <linux-arm-kernel@lists.infradead.org> Cc: linux-mips@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Cc: catalin.marinas@arm.com Cc: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux@armlinux.org.uk Cc: Ralf Baechle <ralf@linux-mips.org> Cc: paul.burton@mips.com Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: salyzyn@android.com Cc: pcc@google.com Cc: shuah@kernel.org Cc: 0x7f454c46@gmail.com Cc: linux@rasmusvillemoes.dk Cc: huw@codeweavers.com Cc: sthotton@marvell.com Cc: andre.przywara@arm.com Cc: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906261159230.32342@nanos.tec.linutronix.de
2019-06-26 10:02:00 +00:00
/*
* x86 specific delta calculation.
*
* The regular implementation assumes that clocksource reads are globally
* monotonic. The TSC can be slightly off across sockets which can cause
* the regular delta calculation (@cycles - @last) to return a huge time
* jump.
*
* Therefore it needs to be verified that @cycles are greater than
* @last. If not then use @last, which is the base time of the current
* conversion period.
*
* This variant also removes the masking of the subtraction because the
* clocksource mask of all VDSO capable clocksources on x86 is U64_MAX
* which would result in a pointless operation. The compiler cannot
* optimize it away as the mask comes from the vdso data and is not compile
* time constant.
*/
static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
if (cycles > last)
return (cycles - last) * mult;
return 0;
}
#define vdso_calc_delta vdso_calc_delta
#endif /* !__ASSEMBLY__ */
#endif /* __ASM_VDSO_GETTIMEOFDAY_H */