linux/arch/x86/kernel
Thomas Gleixner 54ff7e595d x86: hpet: Work around hardware stupidity
This more or less reverts commits 08be979 (x86: Force HPET
readback_cmp for all ATI chipsets) and 30a564be (x86, hpet: Restrict
read back to affected ATI chipsets) to the status of commit 8da854c
(x86, hpet: Erratum workaround for read after write of HPET
comparator).

The delta to commit 8da854c is mostly comments and the change from
WARN_ONCE to printk_once as we know the call path of this function
already.

This needs really in depth explanation:

First of all the HPET design is a complete failure. Having a counter
compare register which generates an interrupt on matching values
forces the software to do at least one superfluous readback of the
counter register.

While it is nice in theory to program "absolute" time events it is
practically useless because the timer runs at some absurd frequency
which can never be matched to real world units. So we are forced to
calculate a relative delta and this forces a readout of the actual
counter value, adding the delta and programming the compare
register. When the delta is small enough we run into the danger that
we program a compare value which is already in the past. Due to the
compare for equal nature of HPET we need to read back the counter
value after writing the compare rehgister (btw. this is necessary for
absolute timeouts as well) to make sure that we did not miss the timer
event. We try to work around that by setting the minimum delta to a
value which is larger than the theoretical time which elapses between
the counter readout and the compare register write, but that's only
true in theory. A NMI or SMI which hits between the readout and the
write can easily push us beyond that limit. This would result in
waiting for the next HPET timer interrupt until the 32bit wraparound
of the counter happens which takes about 306 seconds.

So we designed the next event function to look like:

   match = read_cnt() + delta;
   write_compare_ref(match);
   return read_cnt() < match ? 0 : -ETIME;

At some point we got into trouble with certain ATI chipsets. Even the
above "safe" procedure failed. The reason was that the write to the
compare register was delayed probably for performance reasons. The
theory was that they wanted to avoid the synchronization of the write
with the HPET clock, which is understandable. So the write does not
hit the compare register directly instead it goes to some intermediate
register which is copied to the real compare register in sync with the
HPET clock. That opens another window for hitting the dreaded "wait
for a wraparound" problem.

To work around that "optimization" we added a read back of the compare
register which either enforced the update of the just written value or
just delayed the readout of the counter enough to avoid the issue. We
unfortunately never got any affirmative info from ATI/AMD about this.

One thing is sure, that we nuked the performance "optimization" that
way completely and I'm pretty sure that the result is worse than
before some HW folks came up with those.

Just for paranoia reasons I added a check whether the read back
compare register value was the same as the value we wrote right
before. That paranoia check triggered a couple of years after it was
added on an Intel ICH9 chipset. Venki added a workaround (commit
8da854c) which was reading the compare register twice when the first
check failed. We considered this to be a penalty in general and
restricted the readback (thus the wasted CPU cycles) to the known to
be affected ATI chipsets.

This turned out to be a utterly wrong decision. 2.6.35 testers
experienced massive problems and finally one of them bisected it down
to commit 30a564be which spured some further investigation.

Finally we got confirmation that the write to the compare register can
be delayed by up to two HPET clock cycles which explains the problems
nicely. All we can do about this is to go back to Venki's initial
workaround in a slightly modified version.

Just for the record I need to say, that all of this could have been
avoided if hardware designers and of course the HPET committee would
have thought about the consequences for a split second. It's out of my
comprehension why designing a working timer is so hard. There are two
ways to achieve it:

 1) Use a counter wrap around aware compare_reg <= counter_reg
    implementation instead of the easy compare_reg == counter_reg

    Downsides:

	- It needs more silicon.

	- It needs a readout of the counter to apply a relative
	  timeout. This is necessary as the counter does not run in
	  any useful (and adjustable) frequency and there is no
	  guarantee that the counter which is used for timer events is
	  the same which is used for reading the actual time (and
	  therefor for calculating the delta)

    Upsides:

	- None

  2) Use a simple down counter for relative timer events

    Downsides:

	- Absolute timeouts are not possible, which is not a problem
	  at all in the context of an OS and the expected
	  max. latencies/jitter (also see Downsides of #1)

   Upsides:

	- It needs less or equal silicon.

	- It works ALWAYS

	- It is way faster than a compare register based solution (One
	  write versus one write plus at least one and up to four
	  reads)

I would not be so grumpy about all of this, if I would not have been
ignored for many years when pointing out these flaws to various
hardware folks. I really hate timers (at least those which seem to be
designed by janitors).

Though finally we got a reasonable explanation plus a solution and I
want to thank all the folks involved in chasing it down and providing
valuable input to this.

Bisected-by: Nix <nix@esperi.org.uk>
Reported-by: Artur Skawina <art.08.09@gmail.com>
Reported-by: Damien Wyart <damien.wyart@free.fr>
Reported-by: John Drescher <drescherjm@gmail.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-09-15 00:55:13 +02:00
..
acpi Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 10:07:34 -07:00
apic x86, UV: Fix initialization of max_pnode 2010-09-10 17:15:49 +02:00
cpu Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-09-08 11:14:10 -07:00
.gitignore
alternative.c x86, alternatives: BUG on encountering an invalid CPU feature number 2010-07-13 14:57:50 -07:00
amd_iommu_init.c x86/amd-iommu: Fall back to GART if initialization fails 2010-06-01 10:20:15 +02:00
amd_iommu.c x86/amd-iommu: Export cache-coherency capability 2010-07-27 17:14:24 +02:00
apb_timer.c x86, mrst: add more timer config options 2010-05-19 13:45:39 -07:00
aperture_64.c x86, gcc-4.6: Fix set but not read variables 2010-07-20 15:38:30 -07:00
apm_32.c update email address 2010-07-19 10:56:54 +02:00
asm-offsets_32.c
asm-offsets_64.c tracing: Define NR_syscalls for x86_64 2009-08-26 21:29:58 +02:00
asm-offsets.c
audit_64.c
bios_uv.c Merge branch 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 11:00:55 -08:00
bootflag.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
check.c
cpuid.c x86: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
crash_dump_32.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
crash_dump_64.c
crash.c x86, UV: Make kdump avoid stack dumps 2010-07-21 11:33:27 -07:00
doublefault_32.c x86: Use get_desc_base() 2009-07-19 18:27:51 +02:00
dumpstack_32.c x86: Unify dumpstack.h and stacktrace.h 2010-06-08 23:29:52 +02:00
dumpstack_64.c x86: Unify dumpstack.h and stacktrace.h 2010-06-08 23:29:52 +02:00
dumpstack.c x86: Unify dumpstack.h and stacktrace.h 2010-06-08 23:29:52 +02:00
e820.c suspend: Move NVS save/restore code to generic suspend functionality 2010-06-10 11:02:34 -04:00
early_printk.c earlyprintk,vga,kdb: Fix \b and \r for earlyprintk=vga with kdb 2010-05-20 21:04:31 -05:00
early-quirks.c x86: hpet: Work around hardware stupidity 2010-09-15 00:55:13 +02:00
efi_32.c
efi_64.c x86: Make 64-bit efi_ioremap use ioremap on MMIO regions 2009-08-03 13:34:25 -07:00
efi_stub_32.S
efi_stub_64.S
efi.c x86: Remove trailing spaces in messages 2010-02-07 17:47:51 +01:00
entry_32.S Merge branch 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 16:24:17 -07:00
entry_64.S Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
ftrace.c Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core 2010-02-27 10:06:10 +01:00
head32.c fix typos concerning "initiali[zs]e" 2010-06-16 18:05:05 +02:00
head64.c x86: Make sure free_init_pages() frees pages on page boundary 2010-03-29 18:55:33 +02:00
head_32.S x86-32: Separate 1:1 pagetables from swapper_pg_dir 2010-08-18 09:17:20 -07:00
head_64.S x86-64: Simplify loading initial_gs 2010-07-21 21:23:51 -07:00
head.c
hpet.c x86: hpet: Work around hardware stupidity 2010-09-15 00:55:13 +02:00
hw_breakpoint.c x86: Support for instruction breakpoints 2010-06-24 23:35:27 +02:00
i386_ksyms_32.c x86: Don't generate cmpxchg8b_emu if CONFIG_X86_CMPXCHG64=y 2009-10-01 08:42:24 +02:00
i387.c Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm 2010-08-22 11:27:36 -07:00
i8237.c
i8253.c i8253: Convert i8253_lock to raw_spinlock 2010-03-02 10:28:38 +01:00
i8259.c x86, i8259: Only register sysdev if we have a real 8259 PIC 2010-07-20 15:27:33 -07:00
init_task.c Rename .data.cacheline_aligned to .data..cacheline_aligned. 2010-03-03 11:25:58 +01:00
io_delay.c
ioport.c x86-64, paravirt: Call set_iopl_mask() on 64 bits 2009-12-09 16:54:08 -08:00
irq_32.c x86: Unify fixup_irqs() for 32-bit and 64-bit kernels 2009-11-02 15:56:34 +01:00
irq_64.c x86: Unify fixup_irqs() for 32-bit and 64-bit kernels 2009-11-02 15:56:34 +01:00
irq.c genirq: Convert irq_desc.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
irqinit.c x86: Merge simd_math_error() into math_error() 2010-05-03 13:39:29 -07:00
k8.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
kdebugfs.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
kgdb.c kgdb: add missing __percpu markup in arch/x86/kernel/kgdb.c 2010-08-16 15:58:30 -05:00
kprobes.c kprobes/x86: Fix the return address of multiple kretprobes 2010-08-19 12:49:56 +02:00
kvm.c KVM guest: do not batch pte updates from interrupt context 2009-09-10 18:10:50 +03:00
kvmclock.c x86, paravirt: don't compute pvclock adjustments if we trust the tsc 2010-05-19 11:41:05 +03:00
ldt.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
machine_kexec_32.c Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-08 13:27:33 -08:00
machine_kexec_64.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
Makefile x86, olpc: Add support for calling into OpenFirmware 2010-06-18 14:54:36 -07:00
mca_32.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
microcode_amd.c Revert "x86: ucode-amd: Load ucode-patches once ..." 2010-01-23 06:21:59 +01:00
microcode_core.c driver core: add devname module aliases to allow module on-demand auto-loading 2010-05-25 15:08:26 -07:00
microcode_intel.c x86: Improve Intel microcode loader performance 2010-03-11 13:49:06 +01:00
mmconf-fam10h_64.c x86: Move range related operation to one file 2010-02-10 17:47:17 -08:00
module.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
mpparse.c x86, apic: Map the local apic when parsing the MP table. 2010-08-05 16:26:42 -07:00
mrst.c Merge branch 'x86-mrst-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 10:06:28 -07:00
msr.c x86: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
olpc_ofw.c x86, olpc: Constify an olpc_ofw() arg 2010-07-30 18:02:21 -07:00
olpc.c x86, olpc: Constify an olpc_ofw() arg 2010-07-30 18:02:21 -07:00
paravirt_patch_32.c
paravirt_patch_64.c
paravirt-spinlocks.c locking: Convert __raw_spin* functions to arch_spin* 2009-12-14 23:55:32 +01:00
paravirt.c x86, paravirt: Remove kmap_atomic_pte paravirt op. 2010-02-27 14:41:35 -08:00
pci-calgary_64.c x86, Calgary: Limit the max PHB number to 256 2010-06-30 22:41:42 -07:00
pci-dma.c x86: Detect whether we should use Xen SWIOTLB. 2010-08-02 15:18:33 -04:00
pci-gart_64.c Merge branch 'iommu/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/linux-2.6-iommu into x86/urgent 2010-04-13 13:24:54 +02:00
pci-nommu.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pci-swiotlb.c x86: remove unnecessary sync_single_range_* in swiotlb_dma_ops 2010-05-27 09:12:52 -07:00
pcspeaker.c
pmtimer_64.c
probe_roms_32.c
process_32.c x86, perf: Add power_end event to process_*.c cpu_idle routine 2010-06-18 11:35:10 +02:00
process_64.c x86, perf: Add power_end event to process_*.c cpu_idle routine 2010-06-18 11:35:10 +02:00
process.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
ptrace.c hw-breakpoints: Tag ptrace breakpoint as exclude_kernel 2010-05-01 04:32:07 +02:00
pvclock.c x86, paravirt: don't compute pvclock adjustments if we trust the tsc 2010-05-19 11:41:05 +03:00
quirks.c x86: Force HPET readback_cmp for all ATI chipsets 2010-07-15 17:10:02 +02:00
reboot_fixups_32.c cs5535: move the DIVIL MSR definition into linux/cs5535.h 2009-12-15 08:53:28 -08:00
reboot.c x86: Fix rebooting on Dell Precision WorkStation T7400 2010-06-20 09:24:13 +02:00
relocate_kernel_32.S
relocate_kernel_64.S
rtc.c Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-18 14:05:47 -07:00
scx200_32.c
setup_percpu.c Fix up trivial spelling errors ('taht' -> 'that') 2010-07-21 09:25:42 -07:00
setup.c x86-32: Separate 1:1 pagetables from swapper_pg_dir 2010-08-18 09:17:20 -07:00
sfi.c x86, irq: Rename gsi_end gsi_top, and fix off by one errors 2010-06-09 13:34:06 -07:00
signal.c x86: Merge sys_sigaltstack 2009-12-09 16:28:59 -08:00
smp.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
smpboot.c x86, hotplug: Serialize CPU hotplug to avoid bringup concurrency issues 2010-08-19 14:47:43 -07:00
stacktrace.c x86: Unify save_stack_address() and save_stack_address_nosched() 2010-06-09 17:32:19 +02:00
step.c x86, ptrace: Fix block-step 2010-03-26 11:33:57 +01:00
sys_i386_32.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
sys_x86_64.c improve sys_newuname() for compat architectures 2010-03-12 15:52:32 -08:00
syscall_64.c
syscall_table_32.S x86: fix up system call numbering nit 2010-08-10 15:35:10 -07:00
tboot.c Merge branch 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm 2010-05-21 17:16:21 -07:00
tce_64.c
test_nx.c
test_rodata.c
time.c x86: Convert i8259_lock to raw_spinlock 2010-02-16 18:21:32 +01:00
tlb_uv.c x86, UV: Initialize BAU hub map 2010-08-01 09:18:41 +02:00
tls.c
tls.h
topology.c
trampoline_32.S x86: cpuinit-annotate SMP boot trampolines properly 2009-09-20 20:23:37 +02:00
trampoline_64.S x86: Fix Suspend to RAM freeze on Acer Aspire 1511Lmi laptop 2009-10-12 18:06:48 +02:00
trampoline.c x86, mm: Fix CONFIG_VMSPLIT_1G and 2G_OPT trampoline 2010-08-24 23:05:17 -07:00
traps.c Merge branch 'perf/nmi' into perf/core 2010-08-05 08:45:05 +02:00
tsc_sync.c locking: Convert __raw_spin* functions to arch_spin* 2009-12-14 23:55:32 +01:00
tsc.c x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states 2010-08-20 14:59:02 +02:00
uv_irq.c Merge branch 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 09:46:35 -07:00
uv_sysfs.c x86: Remove trailing spaces in messages 2010-02-07 17:47:51 +01:00
uv_time.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
verify_cpu_64.S x86: Use symbolic MSR names 2010-07-21 21:23:40 -07:00
visws_quirks.c Merge remote branch 'origin/x86/apic' into x86/mrst 2010-02-22 16:25:18 -08:00
vm86_32.c x86, 32-bit: Convert sys_vm86 & sys_vm86old 2009-12-09 16:29:23 -08:00
vmi_32.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
vmiclock_32.c Merge branch 'for-next' into for-linus 2010-03-08 16:55:37 +01:00
vmlinux.lds.S Merge branch 'for-35' of git://repo.or.cz/linux-kbuild 2010-06-01 08:55:52 -07:00
vsmp_64.c
vsyscall_64.c timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset 2010-07-27 12:40:54 +02:00
x86_init.c x86: Add i8042 pre-detection hook to x86_platform_ops 2010-07-07 17:05:06 -07:00
x8664_ksyms_64.c x86-64: Don't export init_level4_pgt 2010-04-28 17:25:47 -07:00
xsave.c Merge branch 'x86-xsave-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 16:25:13 -07:00