Enable local interrupts before requesting C1 on the last two generations
of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake.
This decreases average C1 interrupt latency by about 5-10%, as measured
with the 'wult' tool.
The '->enter()' function of the driver enters C-states with local
interrupts disabled by executing the 'monitor' and 'mwait' pair of
instructions. If an interrupt happens, the CPU exits the C-state and
continues executing instructions after 'mwait'. It does not jump to
the interrupt handler, because local interrupts are disabled. The
cpuidle subsystem enables interrupts a bit later, after doing some
housekeeping.
With this patch, we enable local interrupts before requesting C1. In
this case, if the CPU wakes up because of an interrupt, it will jump
to the interrupt handler right away. The cpuidle housekeeping will be
done after the pending interrupt(s) are handled.
Enabling interrupts before entering a C-state has measurable impact
for faster C-states, like C1. Deeper, but slower C-states like C6 do
not really benefit from this sort of change, because their latency is
a lot higher comparing to the delay added by cpuidle housekeeping.
This change was also tested with cyclictest and dbench. In case of Ice
Lake, the average cyclictest latency decreased by 5.1%, and the average
'dbench' throughput increased by about 0.8%. Both tests were run for 4
hours with only C1 enabled (all other idle states, including 'POLL',
were disabled). CPU frequency was pinned to HFM, and uncore frequency
was pinned to the maximum value. The other platforms had similar
single-digit percentage improvements.
It is worth noting that this patch affects 'cpuidle' statistics a tiny
bit. Before this patch, C1 residency did not include the interrupt
handling time, but with this patch, it will include it. This is similar
to what happens in case of the 'POLL' state, which also runs with
interrupts enabled.
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because cpuidle assumes worst-case C-state parameters, PC6 parameters
are used for describing C6, which is worst-case for requesting CC6.
When PC6 is enabled, this is appropriate. But if PC6 is disabled
in the BIOS, the exit latency and target residency should be adjusted
accordingly.
Exit latency:
Previously the C6 exit latency was measured as the PC6 exit latency.
With PC6 disabled, the C6 exit latency should be the one of CC6.
Target residency:
With PC6 disabled, the idle duration within [CC6, PC6) would make the
idle governor choose C1E over C6. This would cause low energy-efficiency.
We should lower the bar to request C6 when PC6 is disabled.
To fill this gap, check if PC6 is disabled in the BIOS in the
MSR_PKG_CST_CONFIG_CONTROL(0xe2) register. If so, use the CC6 exit latency
for C6 and set target_residency to 3 times of the new exit latency. [This
is consistent with how intel_idle driver uses _CST to calculate the
target_residency.] As a result, the OS would be more likely to choose C6
over C1E when PC6 is disabled, which is reasonable, because if C6 is
enabled, it implies that the user cares about energy, so choosing C6 more
frequently makes sense.
The new CC6 exit latency of 92us was measured with wult[1] on SKX via NIC
wakeup as the 99.99th percentile. Also CLX and CPX both have the same CPU
model number as SkX, but their CC6 exit latencies are similar to the SKX
one, 96us and 89us respectively, so reuse the SKX value for them.
There is a concern that it might be better to use a more generic approach
instead of optimizing every platform. However, if the required code
complexity and different PC6 bit interpretation on different platforms
are taken into account, tuning the code per platform seems to be an
acceptable tradeoff.
Link: https://intel.github.io/wult/ # [1]
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds Icelake Xeon D support to the intel_idle driver.
Since Icelake D and Icelake SP C-state characteristics the same,
we use Icelake SP C-states table for Icelake D as well.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Change IceLake Xeon C6 latency from 128 us to 170 us. The latency
was measured with the "wult" tool and corresponds to the 99.99th
percentile when measuring with the "nic" method. Note, the 128 us
figure correspond to the median latency, but in intel_idle we use
the "worst case" latency figure instead.
C6 target residency was increased from 384 us to 600 us, which may
result in less C6 residency in some workloads. This value was tested
and compared to values 384, and 1000. Value 600 is a reasonable
tradeoff between power and performance.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Defining DEBUG should only be done in development.
So remove DEBUG.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add C-state table for the SnowRidge SoC which is found on Intel Jacobsville
platforms.
The following has been changed.
1. C1E latency changed from 10us to 15us. It was measured using the
open source "wult" tool (the "nic" method, 15us is the 99.99th
percentile).
2. C1E power break even changed from 20us to 25us, which may result
in less C1E residency in some workloads.
3. C6 latency changed from 50us to 130us. Measured the same way as C1E.
The C6 C-state is supported only by some SnowRidge revisions, so add a C-state
table commentary about this.
On SnowRidge, C6 support is enumerated via the usual mechanism: "mwait" leaf of
the "cpuid" instruction. The 'intel_idle' driver does check this leaf, so even
though C6 is present in the table, the driver will only use it if the CPU does
support it.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpuidle->enter() callbacks should not call into tracing because RCU
has already been disabled. Instead of doing the broadcast thing
itself, simply advertise to the cpuidle core that those states stop
the timer.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20201123143510.GR3021@hirez.programming.kicks-ass.net
Currently intel_idle driver gets the c-state information from ACPI
_CST if the processor model is not recognized by it. However the
c-state in _CST starts with index 1 which is different from the
index in intel_idle driver's internal c-state table.
While intel_idle_max_cstate_reached() was previously introduced to
deal with intel_idle driver's internal c-state table, re-using
this function directly on _CST is incorrect.
Fix this by subtracting 1 from the index when checking max_cstate
in the _CST case.
For example, append intel_idle.max_cstate=1 in boot command line,
Before the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
POLL
After the patch:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
Fixes: 18734958e9 ("intel_idle: Use ACPI _CST for processor models without C-state tables")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems") avoids
enabling c-states that have been disabled by the platform with the
exception of C1E.
Unfortunately, BIOS implementations are not always consistent in terms
of how capabilities are advertised and control cannot always be handed
over. If control cannot be handed over then intel_idle reports that "ACPI
_CST not found or not usable" but does not clear acpi_state_table.count
meaning the information is still partially used.
This patch ignores ACPI information if CST control cannot be requested from
the platform. This was only observed on a number of Haswell platforms that
had identical CPUs but not identical BIOS versions. While this problem
may be rare overall, 24 separate test cases bisected to this specific
commit across 4 separate test machines and is worth addressing. If the
situation occurs, the kernel behaves as it did before commit e6d4f08a67
and uses any c-states that are discovered.
The affected test cases were all ones that involved a small number of
processes -- exec microbenchmark, pipe microbenchmark, git test suite,
netperf, tbench with one client and system call microbenchmark. Each
case benefits from being able to use turboboost which is prevented if the
lower c-states are unavailable. This may mask real regressions specific
to older hardware so it is worth addressing.
C-state status before and after the patch
5.9.0-vanilla POLL latency:0 disabled:0 default:enabled
5.9.0-vanilla C1 latency:2 disabled:0 default:enabled
5.9.0-vanilla C1E latency:10 disabled:0 default:enabled
5.9.0-vanilla C3 latency:33 disabled:1 default:disabled
5.9.0-vanilla C6 latency:133 disabled:1 default:disabled
5.9.0-ignore-cst-v1r1 POLL latency:0 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C1 latency:2 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C1E latency:10 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C3 latency:33 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C6 latency:133 disabled:0 default:enabled
Patch enables C3/C6.
Netperf UDP_STREAM
netperf-udp
5.5.0 5.9.0
vanilla ignore-cst-v1r1
Hmean send-64 193.41 ( 0.00%) 226.54 * 17.13%*
Hmean send-128 392.16 ( 0.00%) 450.54 * 14.89%*
Hmean send-256 769.94 ( 0.00%) 881.85 * 14.53%*
Hmean send-1024 2994.21 ( 0.00%) 3468.95 * 15.85%*
Hmean send-2048 5725.60 ( 0.00%) 6628.99 * 15.78%*
Hmean send-3312 8468.36 ( 0.00%) 10288.02 * 21.49%*
Hmean send-4096 10135.46 ( 0.00%) 12387.57 * 22.22%*
Hmean send-8192 17142.07 ( 0.00%) 19748.11 * 15.20%*
Hmean send-16384 28539.71 ( 0.00%) 30084.45 * 5.41%*
Fixes: e6d4f08a67 ("intel_idle: Use ACPI _CST on server systems")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: 5.6+ <stable@vger.kernel.org> # 5.6+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Intel SDM does not explicitly say that entering a C-state via MWAIT will
implicitly flush CPU caches as appropriate for that C-state. However,
documentation for individual Intel CPU generations does mention this
behavior.
Since intel_idle binds to any Intel CPU with MWAIT, list this assumption
of MWAIT behavior.
In passing, reword opening comment to make it clear that the driver can
load on any old and future Intel CPU with MWAIT.
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org
On ICX platform, the C1E auto-promotion is enabled by default.
As a result, the CPU might fall into C1E more offen than previous
platforms. Besides, the C1E is not exposed to sysfs on ICX, which
is inconsistent with previous server platforms.
So disable C1E auto-promotion and expose C1E as a separate idle
state, so the C1E and C6 can be disabled via sysfs when necessary.
Beside C1 and C1E, the exit latency of C6 was measured
by a dedicated tool. However the exit latency(41us) exposed
by _CST is much smaller than the one we measured(128us). This
is probably due to the _CST uses the exit latency when woken
up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose
the latter as we need the longest latency in theory.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Control Flow Integrity(CFI) is a security mechanism that disallows
changes to the original control flow graph of a compiled binary,
making it significantly harder to perform such attacks.
init_state_node() assign same function callback to different
function pointer declarations.
static int init_state_node(struct cpuidle_state *idle_state,
const struct of_device_id *matches,
struct device_node *state_node) { ...
idle_state->enter = match_id->data; ...
idle_state->enter_s2idle = match_id->data; }
Function declarations:
struct cpuidle_state { ...
int (*enter) (struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index);
void (*enter_s2idle) (struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index); };
In this case, either enter() or enter_s2idle() would cause CFI check
failed since they use same callee.
Align function prototype of enter() since it needs return value for
some use cases. The return value of enter_s2idle() is no
need currently.
Signed-off-by: Neal Liu <neal.liu@mediatek.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
The value of the lapic_timer_always_reliable static variable in
the intel_idle driver reflects the boot_cpu_has(X86_FEATURE_ARAT)
value and so it also reflects the static_cpu_has(X86_FEATURE_ARAT)
value.
Hence, the lapic_timer_always_reliable check in intel_idle() is
redundant and apart from this lapic_timer_always_reliable is only
used in two places in which boot_cpu_has(X86_FEATURE_ARAT) can be
used directly.
Eliminate the lapic_timer_always_reliable variable in accordance
with the above observations.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:
Kernel side changes:
- A couple of x86/cpu cleanups and changes were grandfathered in due
to patch dependencies. These clean up the set of CPU model/family
matching macros with a consistent namespace and C99 initializer
style.
- A bunch of updates to various low level PMU drivers:
* AMD Family 19h L3 uncore PMU
* Intel Tiger Lake uncore support
* misc fixes to LBR TOS sampling
- optprobe fixes
- perf/cgroup: optimize cgroup event sched-in processing
- misc cleanups and fixes
Tooling side changes are to:
- perf {annotate,expr,record,report,stat,test}
- perl scripting
- libapi, libperf and libtraceevent
- vendor events on Intel and S390, ARM cs-etm
- Intel PT updates
- Documentation changes and updates to core facilities
- misc cleanups, fixes and other enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
cpufreq/intel_pstate: Fix wrong macro conversion
x86/cpu: Cleanup the now unused CPU match macros
hwrng: via_rng: Convert to new X86 CPU match macros
crypto: Convert to new CPU match macros
ASoC: Intel: Convert to new X86 CPU match macros
powercap/intel_rapl: Convert to new X86 CPU match macros
PCI: intel-mid: Convert to new X86 CPU match macros
mmc: sdhci-acpi: Convert to new X86 CPU match macros
intel_idle: Convert to new X86 CPU match macros
extcon: axp288: Convert to new X86 CPU match macros
thermal: Convert to new X86 CPU match macros
hwmon: Convert to new X86 CPU match macros
platform/x86: Convert to new CPU match macros
EDAC: Convert to new X86 CPU match macros
cpufreq: Convert to new X86 CPU match macros
ACPI: Convert to new X86 CPU match macros
x86/platform: Convert to new CPU match macros
x86/kernel: Convert to new CPU match macros
x86/kvm: Convert to new CPU match macros
x86/perf/events: Convert to new CPU match macros
...
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.
Get rid the of the local macro wrappers for consistency.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131510.193755545@linutronix.de
Update the copyright notice in intel_idle.c to cover the recent
changes, drop the description of a "known limitation" that is not
a limitation any more and bump up the driver version number.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use the BIT() macro for defining CPUIDLE_FLAG_TLB_FLUSHED instead of
the hex bit encoding of the same value.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Turn the description comments of some functions in the intel_idle
driver into proper kerneldoc ones and clean them up.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reorder declarations of static variables so that the __initdata ones
are declared together.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add __initdata or __initconst annotations to the static data
structures that are only used during the initialization of the
driver.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Annotate static variables cpuidle_state_table and mwait_substates
with __initdata, because they are only used during the initialization
of the driver.
Also notice that static variable icpu could be annotated analogously
and the structure pointed to by it could be __initconst, but two of
its fields are accessed via icpu in intel_idle_cpu_init() and
auto_demotion_disable(), so introduce two new static variables,
auto_demotion_disable_flags and disable_promotion_to_c1e, to hold
the values of these fields, set them during the initialization and
use them in those functions instead of accessing the source data
structure via icpu.
That allows icpu to be annotated with __initdata, so do that,
and it will also allow some __initconst annotations to be added
subsequently.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the definitions of intel_idle() and intel_idle_s2idle() before
the definitions of cpuidle_state structures referring to them to
avoid having to use additional declarations of them (and drop those
declarations).
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add proper kerneldoc descriptions to intel_idle() and
intel_idle_s2idle(), annotate the latter with __cpuidle and
reorder the declarations of local variables in both of them to
reflect the mwait_idle_with_hints() arguments order.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The lapic_timer_always_reliable variable really takes only two values
and some arithmetic in intel_idle() related to comparing it with the
target C-state's MWAIT hint value is unnecessary.
Simplify the code by replacing lapic_timer_always_reliable with
a bool variable lapic_timer_always_reliable and dropping the
LAPIC_TIMER_ALWAYS_RELIABLE symbol along with the excess
computations in intel_idle().
While at it, add a comment explaining the branch taken in intel_idle()
if the LAPIC timer is only reliable in C1 and modify the related debug
message in intel_idle_init() accordingly (the modification of this
message in the only expected functional impact of the change made
here).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In certain system configurations it may not be desirable to use some
C-states assumed to be available by intel_idle and the driver needs
to be prevented from using them even before the cpuidle sysfs
interface becomes accessible to user space. Currently, the only way
to achieve that is by setting the 'max_cstate' module parameter to a
value lower than the index of the shallowest of the C-states in
question, but that may be overly intrusive, because it effectively
makes all of the idle states deeper than the 'max_cstate' one go
away (and the C-state to avoid may be in the middle of the range
normally regarded as available).
To allow that limitation to be overcome, introduce a new module
parameter called 'states_off' to represent a list of idle states to
be disabled by default in the form of a bitmask and update the
documentation to cover it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
For diagnostics, it is generally useful to be able to make intel_idle
take the system's ACPI tables into consideration even if that is not
required for the processor model in there, so introduce a new module
parameter, 'use_acpi', to make that happen and update the documentation
to cover it.
While at it, fix the 'no_acpi' module parameter name in the
documentation.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull x86 cpu-features updates from Ingo Molnar:
"The biggest change in this cycle was a large series from Sean
Christopherson to clean up the handling of VMX features. This both
fixes bugs/inconsistencies and makes the code more coherent and
future-proof.
There are also two cleanups and a minor TSX syslog messages
enhancement"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/cpu: Remove redundant cpu_detect_cache_sizes() call
x86/cpu: Print "VMX disabled" error message iff KVM is enabled
KVM: VMX: Allow KVM_INTEL when building for Centaur and/or Zhaoxin CPUs
perf/x86: Provide stubs of KVM helpers for non-Intel CPUs
KVM: VMX: Use VMX_FEATURE_* flags to define VMCS control bits
KVM: VMX: Check for full VMX support when verifying CPU compatibility
KVM: VMX: Use VMX feature flag to query BIOS enabling
KVM: VMX: Drop initialization of IA32_FEAT_CTL MSR
x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured
x86/cpu: Set synthetic VMX cpufeatures during init_ia32_feat_ctl()
x86/cpu: Print VMX flags in /proc/cpuinfo using VMX_FEATURES_*
x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs
x86/vmx: Introduce VMX_FEATURES_*
x86/cpu: Clear VMX feature flag if VMX is not fully enabled
x86/zhaoxin: Use common IA32_FEAT_CTL MSR initialization
x86/centaur: Use common IA32_FEAT_CTL MSR initialization
x86/mce: WARN once if IA32_FEAT_CTL MSR is left unlocked
x86/intel: Initialize IA32_FEAT_CTL MSR at boot
tools/x86: Sync msr-index.h from kernel sources
selftests, kvm: Replace manual MSR defs with common msr-index.h
...
Move the irtl_ns_units[] definition into irtl_2_usec() which is the
only user of it, use div_u64() for the division in there (as the
divisor is small enough) and use the NSEC_PER_USEC symbol for the
divisor. Also convert the irtl_2_usec() comment to a proper
kerneldo one.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move intel_idle_verify_cstate(), auto_demotion_disable() and
c1e_promotion_disable() closer to their callers.
While at it, annotate intel_idle_verify_cstate() with __init,
as it is only used during the initialization of the driver.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Annotate the functions that are only used at the initialization time
with __init and the data structures used by them with __initdata or
__initconst.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move intel_idle_cpuidle_devices_uninit() closer to its caller,
intel_idle_init(), add the __init modifier to its header, drop a
redundant local variable from it and fix up its kerneldoc comment.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that intel_idle_state_table_update() only needs to be called
if icpu is not NULL, so fold it into intel_idle_init_cstates_icpu(),
and pass a pointer to the driver object to
intel_idle_cpuidle_driver_init() as an argument instead of
referencing it locally in there.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Instead of comparing intel_idle_cpuidle_devices with NULL apply
the "!" (not) operator to it when checking it against NULL.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is no particular reason why intel_idle_probe() needs to be
a separate function and folding it into intel_idle_init() causes
the code to be somewhat easier to follow, so do just that.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The __setup_broadcast_timer() static function is only called in one
place and "true" is passed to it as the argument in there, so
effectively it is a wrapper arround tick_broadcast_enable().
To simplify the code, call tick_broadcast_enable() directly instead
of __setup_broadcast_timer() and drop the latter.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
are quite a mouthful, especially the VMX bits which must differentiate
between enabling VMX inside and outside SMX (TXT) operation. Rename the
MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
make them a little friendlier on the eyes.
Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
to match Intel's SDM, but a future patch will add a dedicated Kconfig,
file and functions for the MSR. Using the full name for those assets is
rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
nomenclature is consistent throughout the kernel.
Opportunistically, fix a few other annoyances with the defines:
- Relocate the bit defines so that they immediately follow the MSR
define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
- Add whitespace around the block of feature control defines to make
it clear they're all related.
- Use BIT() instead of manually encoding the bit shift.
- Use "VMX" instead of "VMXON" to match the SDM.
- Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
be consistent with the kernel's verbiage used for all other feature
control bits. Note, the SDM refers to the LMCE bit as LMCE_ON,
likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore
the (literal) one-off usage of _ON, the SDM is simply "wrong".
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
In many cases, especially on server systems, it is desirable to avoid
enabling C-states that have been disabled in the platform firmware
(BIOS) setup, except for C1E.
As a rule, the C-states disabled this way are not listed by ACPI
_CST, so if that is used by intel_idle along with the specific
table of C-states that it has for the given processor, the C-states
disabled through the platform firmware will not be enabled by default
by intel_idle.
Accordingly, set the use_acpi flag (introduced previously) in all
server processor profiles defined in intel_idle (so as to make it use
ACPI _CST to decide which C-states to enable by default) and set
the CPUIDLE_FLAG_ALWAYS_ENABLE flag (also introduced previously)
for C1E in all C-states tables in intel_idle that contain C1 too
(so that C1E is enabled regardless of whether or not it is listed
by ACPI _CST).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a new module parameter called "no_acpi" to the intel_idle driver
to allow the driver to be prevented from using ACPI _CST via kernel
command line.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Update the intel_idle driver to get the C-states information from ACPI
_CST in some cases in which the processor is known to the driver, as long as
that information is available and the new use_acpi flag is set in the
profile of the processor in question.
In the cases when there is a specific table of C-states for the given
processor in the driver, that table is used as the primary source of
information on the available C-states, but if ACPI _CST is present,
the C-states that are not listed by it will not be enabled by default
(they still can be enabled later by user space via sysfs, though).
The new CPUIDLE_FLAG_ALWAYS_ENABLE flag can be used for marking
C-states that should be enabled by default even if they are not
listed by ACPI _CST.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Modify the intel_idle driver to get the C-states information from ACPI
_CST if the processor model is not recognized by it.
The processor is still required to support MWAIT and the information
from ACPI _CST will only be used if all of the C-states listed by
_CST are of the ACPI_CSTATE_FFH type (which means that they are
expected to be entered via MWAIT).
Moreover, the driver assumes that the _CST information is the same
for all CPUs in the system, so it is sufficient to evaluate _CST for
one of them and extract the common list of C-states from there.
Also _CST is evaluated once at the system initialization time and
the driver does not respond to _CST change notifications (that can
be changed in the future).
The main functional difference between intel_idle with this change
and the ACPI processor driver is that the former sets the target
residency to be equal to the exit latency (provided by _CST) for
C1-type C-states and to 3 times the exit latency value for the other
C-state types, whereas the latter obtains the target residency by
multiplying the exit latency by the same number (2 by default) for
all C-state types. Therefore it is expected that in general using
the former instead of the latter on the same system will lead to
improved energy-efficiency.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the C-state verification and checks from
intel_idle_cpuidle_driver_init() to a separate function,
intel_idle_verify_cstate(), and make the former call it after
checking the CPUIDLE_FLAG_UNUSABLE state flag.
Also combine the drv->states[] updates with the incrementation of
drv->state_count.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After recent cpuidle updates the "disabled" field in struct
cpuidle_state is only used by two drivers (intel_idle and shmobile
cpuidle) for marking unusable idle states, but that may as well be
achieved with the help of a state flag, so define an "unusable" idle
state flag, CPUIDLE_FLAG_UNUSABLE, make the drivers in question use
it instead of the "disabled" field and make the core set
CPUIDLE_STATE_DISABLED_BY_DRIVER for the idle states with that flag
set.
After the above changes, the "disabled" field in struct cpuidle_state
is not used any more, so drop it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently big microservers have _XEON_D while small microservers have
_X, Make it uniformly: _D.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \
-e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org
Currently big core clients with extra graphics on have:
- _G
- _GT3E
Make it uniformly: _G
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_GT3E"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_GT3E/\1_G/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.622802314@infradead.org
Currently big core mobile chips have either:
- _L
- _ULT
- _MOBILE
Make it uniformly: _L.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(MOBILE\|ULT\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(MOBILE\|ULT\)/\1_L/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.568978530@infradead.org
Currently the big core client models either have:
- no OPTDIFF
- _CORE
- _DESKTOP
Make it uniformly: 'no OPTDIFF'.
for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(CORE\|DESKTOP\)"`
do
sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_\(CORE\|DESKTOP\)/\1/g' ${i}
done
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190827195122.513945586@infradead.org