s2idle works like a regular suspend with freezing processes and freezing
devices. All CPUs except the control CPU go into idle. Once this is
completed the control CPU kicks all other CPUs out of idle, so that they
reenter the idle loop and then enter s2idle state. The control CPU then
issues an swait() on the suspend state and therefore enters the idle loop
as well.
Due to being kicked out of idle, the other CPUs leave their NOHZ states,
which means the tick is active and the corresponding hrtimer is programmed
to the next jiffie.
On entering s2idle the CPUs shut down their local clockevent device to
prevent wakeups. The last CPU which enters s2idle shuts down its local
clockevent and freezes timekeeping.
On resume, one of the CPUs receives the wakeup interrupt, unfreezes
timekeeping and its local clockevent and starts the resume process. At that
point all other CPUs are still in s2idle with their clockevents switched
off. They only resume when they are kicked by another CPU or after resuming
devices and then receiving a device interrupt.
That means there is no guarantee that all CPUs will wakeup directly on
resume. As a consequence there is no guarantee that timers which are queued
on those CPUs and should expire directly after resume, are handled. Also
timer list timers which are remotely queued to one of those CPUs after
resume will not result in a reprogramming IPI as the tick is
active. Queueing a hrtimer will also not result in a reprogramming IPI
because the first hrtimer event is already in the past.
The recent introduction of the timer pull model (7ee9887703 ("timers:
Implement the hierarchical pull model")) amplifies this problem, if the
current migrator is one of the non woken up CPUs. When a non pinned timer
list timer is queued and the queuing CPU goes idle, it relies on the still
suspended migrator CPU to expire the timer which will happen by chance.
The problem exists since commit 8d89835b04 ("PM: suspend: Do not pause
cpuidle in the suspend-to-idle path"). There the cpuidle_pause() call which
in turn invoked a wakeup for all idle CPUs was moved to a later point in
the resume process. This might not be reached or reached very late because
it waits on a timer of a still suspended CPU.
Address this by kicking all CPUs out of idle after the control CPU returns
from swait() so that they resume their timers and restore consistent system
state.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218641
Fixes: 8d89835b04 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: 5.16+ <stable@kernel.org> # 5.16+
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subsytem:
- rtc_class is now const
Drivers:
- ds1511: driver cleanup, set date and time range and alarm offset limit
- max31335: fix interrupt handler
- pcf8523: improve suspend support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmX8uCcACgkQY6TcMGxw
OjI1hA//fK1Cs2eDL2E7e8bQCJtDJ+gdzVT210LwT9TCOgUV+arqLjF/i/pq8m5j
CK/ukk5OLZL3v5dS7rsOZAiTg1N9Q+3SEoaTP029kRfA2SxOE2isR9W5nNAvyJC0
L4Y4YEcEMsJpluK94G1/4ult1We9vCGSlqlNknChp3BL5ELxg7T/Ea5wda2OLSRx
+ZgQpB6O+LIIQucm8mgCAlucMJrAj4+qEF7HO0AeElDh+md4OvjrF8Y30vTfs82L
hGazWQgu8J4Jmy4u6q00s5IhfsvwNZVYgaASv8ivB8/9qgQLktZ6HFHRSwloRXg2
o9OqcM/OR7Lw4taH/6pnVQ37UUAQ/Vy9QJy2GHza1uDzlBBjtqVz8zFqheu/rmnJ
3AX7hlowBx9fPpxmStP2Ac9WIvtepV25eOlDDSlDiS8/4T/DRJHwpRQghSTDnFJ4
fRU1HFbaj3ZKRgBzu5zJU4XvQO4X/DKxSHVMy+rVtsndWzDa8q11QZrQJAP4ZpLM
/+zfar2RFnsnop1Dg9cz/hq0gWbQUMJ16qN3cFYl3jVVzC3JSld7B0+DhZdII1Lg
YsCXuUQusnOTFm9GtyC6doHgmC3Dy2wentz6Vn6ynyBOR4NKjJPkI7PICdgzkIDi
AJI9qSxQcT6gOSw2PAlzOcYb7AD3cOJCeg3ukt7aDl8S38RKR3w=
=Is7n
-----END PGP SIGNATURE-----
Merge tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Subsytem:
- rtc_class is now const
Drivers:
- ds1511: cleanup, set date and time range and alarm offset limit
- max31335: fix interrupt handler
- pcf8523: improve suspend support"
* tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (28 commits)
MAINTAINER: Include linux-arm-msm for Qualcomm RTC patches
dt-bindings: rtc: zynqmp: Add support for Versal/Versal NET SoCs
rtc: class: make rtc_class constant
dt-bindings: rtc: abx80x: Improve checks on trickle charger constraints
MAINTAINERS: adjust file entry in ARM/Mediatek RTC DRIVER
rtc: nct3018y: fix possible NULL dereference
rtc: max31335: fix interrupt status reg
rtc: mt6397: select IRQ_DOMAIN instead of depending on it
dt-bindings: rtc: abx80x: convert to yaml
rtc: m41t80: Use the unified property API get the wakeup-source property
dt-bindings: at91rm9260-rtt: add sam9x7 compatible
dt-bindings: rtc: convert MT7622 RTC to the json-schema
dt-bindings: rtc: convert MT2717 RTC to the json-schema
rtc: pcf8523: add suspend handlers for alarm IRQ
rtc: ds1511: set alarm offset limit
rtc: ds1511: set range
rtc: ds1511: drop inline/noinline hints
rtc: ds1511: rename pdata
rtc: ds1511: implement ds1511_rtc_read_alarm properly
rtc: ds1511: remove partial alarm support
...
The EM only supports power in uW. Make sure that it is not possible to
register some downstream driver which doesn't provide power in uW.
The only exception is artificial EM, but that EM is ignored by the rest of
kernel frameworks (thermal, powercap, etc).
Reported-by: PoShao Chen <poshao.chen@mediatek.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Allow the Energy Model to be updated dynamically (Lukasz Luba).
- Add support for LZ4 compression algorithm to the hibernation image
creation and loading code (Nikhil V).
- Fix and clean up system suspend statistics collection (Rafael
Wysocki).
- Simplify device suspend and resume handling in the power management
core code (Rafael Wysocki).
- Fix PCI hibernation support description (Yiwei Lin).
- Make hibernation take set_memory_ro() return values into account as
appropriate (Christophe Leroy).
- Set mem_sleep_current during kernel command line setup to avoid an
ordering issue with handling it (Maulik Shah).
- Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
driver's system suspend callback (Qingliang Li).
- Simplify pm_runtime_get_if_active() usage and add a replacement for
pm_runtime_put_autosuspend() (Sakari Ailus).
- Add a tracepoint for runtime_status changes tracking (Vilas Bhat).
- Fix section title markdown in the runtime PM documentation (Yiwei
Lin).
- Enable preferred core support in the amd-pstate cpufreq driver (Meng
Li).
- Fix min_perf assignment in amd_pstate_adjust_perf() and make the
min/max limit perf values in amd-pstate always stay within the
(highest perf, lowest perf) range (Tor Vic, Meng Li).
- Allow intel_pstate to assign model-specific values to strings used in
the EPP sysfs interface and make it do so on Meteor Lake (Srinivas
Pandruvada).
- Drop long-unused cpudata::prev_cummulative_iowait from the
intel_pstate cpufreq driver (Jiri Slaby).
- Prevent scaling_cur_freq from exceeding scaling_max_freq when the
latter is an inefficient frequency (Shivnandan Kumar).
- Change default transition delay in cpufreq to 2ms (Qais Yousef).
- Remove references to 10ms minimum sampling rate from comments in the
cpufreq code (Pierre Gondois).
- Honour transition_latency over transition_delay_us in cpufreq (Qais
Yousef).
- Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar).
- General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
Belova).
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
- Make the SCMI cpufreq driver get a transition delay value from
firmware (Pierre Gondois).
- Prevent the haltpoll cpuidle governor from shrinking guest
poll_limit_ns below grow_start (Parshuram Sangle).
- Avoid potential overflow in integer multiplication when computing
cpuidle state parameters (C Cheng).
- Adjust MWAIT hint target C-state computation in the ACPI cpuidle
driver and in intel_idle to return a correct value for C0 (He
Rongguang).
- Address multiple issues in the TPMI RAPL driver and add support for
new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui).
- Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
Lezcano).
- Fix kernel-doc for dtpm_create_hierarchy() (Yang Li).
- Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
Norway Ananda).
- Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil).
- Fix a couple of warnings in the OPP core code related to W=1
builds (Viresh Kumar).
- Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
Kumar).
- Extend dev_pm_opp_data with turbo support (Sibi Sankar).
- dt-bindings: drop maxItems from inner items (David Heidelberg).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmXvI/ISHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx24sP/jxg6fOGme8raHQvpTXG3/H56wlGzQ4P
YUvvKUXnfD3yf1zNISsUl7VQebZqDt8rygkwSdymXlUVZX1eubN0RpCFc0F8GZuc
THG/YQhYQr/9zro3FpKhfDj5evk21PCQzjf+dGvfQF9qVMxNPG1JzEFK6PnolT5X
2BvkonY1XFWZjCMbZ83B/jt35lTDb0cmeNbCpfD5UJgcnxmMOtZYpORdyfPWTJpG
GVCwmAFVVXxXlust/AIpt3mmOpKzSA9GnrtJkhtQe5GN+Y4OjnJiFJmTC7EfCctj
JlWgVUA716mtFMUrjXgjfI54firF2oQpqaSa2HG/V/A96JWQqjarGz5dAV1IrPEt
ZmYpvMe4E90S411wF1OWyrEqjXUuDnH1OWUvUdWSt4E7DhFw3esDi/jLW2tyVKAT
hIy+/O4wzbDSTX/h9Cgt1Qjhew6lKUIwvhEXclB3fuJ+JoviWNkC9lnK93e2H0A3
VYfkd/lpUD74035l0FrCJ/49MjX9kqrsn+TipHsIlSXAi8ZRdKbVvxOTD8RYudcI
GvCiDDrkMgNwGlyedgbtTBUepCvSg93b+vVmRj7YMPtBhioOUo3qCn6wpqhxfnth
9BCnPW7JxqUw/NJdlk9hKumaUZq+MK8G+kdYcIDg6xmAkWSUVP2QKlWavfMCxqRP
+dN6T2iHsKFe
=UePT
-----END PGP SIGNATURE-----
Merge tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"From the functional perspective, the most significant change here is
the addition of support for Energy Models that can be updated
dynamically at run time.
There is also the addition of LZ4 compression support for hibernation,
the new preferred core support in amd-pstate, new platforms support in
the Intel RAPL driver, new model-specific EPP handling in intel_pstate
and more.
Apart from that, the cpufreq default transition delay is reduced from
10 ms to 2 ms (along with some related adjustments), the system
suspend statistics code undergoes a significant rework and there is a
usual bunch of fixes and code cleanups all over.
Specifics:
- Allow the Energy Model to be updated dynamically (Lukasz Luba)
- Add support for LZ4 compression algorithm to the hibernation image
creation and loading code (Nikhil V)
- Fix and clean up system suspend statistics collection (Rafael
Wysocki)
- Simplify device suspend and resume handling in the power management
core code (Rafael Wysocki)
- Fix PCI hibernation support description (Yiwei Lin)
- Make hibernation take set_memory_ro() return values into account as
appropriate (Christophe Leroy)
- Set mem_sleep_current during kernel command line setup to avoid an
ordering issue with handling it (Maulik Shah)
- Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
driver's system suspend callback (Qingliang Li)
- Simplify pm_runtime_get_if_active() usage and add a replacement for
pm_runtime_put_autosuspend() (Sakari Ailus)
- Add a tracepoint for runtime_status changes tracking (Vilas Bhat)
- Fix section title markdown in the runtime PM documentation (Yiwei
Lin)
- Enable preferred core support in the amd-pstate cpufreq driver
(Meng Li)
- Fix min_perf assignment in amd_pstate_adjust_perf() and make the
min/max limit perf values in amd-pstate always stay within the
(highest perf, lowest perf) range (Tor Vic, Meng Li)
- Allow intel_pstate to assign model-specific values to strings used
in the EPP sysfs interface and make it do so on Meteor Lake
(Srinivas Pandruvada)
- Drop long-unused cpudata::prev_cummulative_iowait from the
intel_pstate cpufreq driver (Jiri Slaby)
- Prevent scaling_cur_freq from exceeding scaling_max_freq when the
latter is an inefficient frequency (Shivnandan Kumar)
- Change default transition delay in cpufreq to 2ms (Qais Yousef)
- Remove references to 10ms minimum sampling rate from comments in
the cpufreq code (Pierre Gondois)
- Honour transition_latency over transition_delay_us in cpufreq (Qais
Yousef)
- Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar)
- General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
Belova)
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan)
- Make the SCMI cpufreq driver get a transition delay value from
firmware (Pierre Gondois)
- Prevent the haltpoll cpuidle governor from shrinking guest
poll_limit_ns below grow_start (Parshuram Sangle)
- Avoid potential overflow in integer multiplication when computing
cpuidle state parameters (C Cheng)
- Adjust MWAIT hint target C-state computation in the ACPI cpuidle
driver and in intel_idle to return a correct value for C0 (He
Rongguang)
- Address multiple issues in the TPMI RAPL driver and add support for
new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui)
- Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
Lezcano)
- Fix kernel-doc for dtpm_create_hierarchy() (Yang Li)
- Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
Norway Ananda)
- Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil)
- Fix a couple of warnings in the OPP core code related to W=1 builds
(Viresh Kumar)
- Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
Kumar)
- Extend dev_pm_opp_data with turbo support (Sibi Sankar)
- dt-bindings: drop maxItems from inner items (David Heidelberg)"
* tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits)
dt-bindings: opp: drop maxItems from inner items
OPP: debugfs: Fix warning around icc_get_name()
OPP: debugfs: Fix warning with W=1 builds
cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
OPP: Extend dev_pm_opp_data with turbo support
Fix cpupower-frequency-info.1 man page typo
cpufreq: scmi: Set transition_delay_us
firmware: arm_scmi: Populate fast channel rate_limit
firmware: arm_scmi: Populate perf commands rate_limit
cpuidle: ACPI/intel: fix MWAIT hint target C-state computation
PM: sleep: wakeirq: fix wake irq warning in system suspend
powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
cpufreq: Don't unregister cpufreq cooling on CPU hotplug
PM: suspend: Set mem_sleep_current during kernel command line setup
cpufreq: Honour transition_latency over transition_delay_us
cpufreq: Limit resolving a frequency to policy min/max
Documentation: PM: Fix runtime_pm.rst markdown syntax
cpufreq: amd-pstate: adjust min/max limit perf
cpufreq: Remove references to 10ms min sampling rate
cpufreq: intel_pstate: Update default EPPs for Meteor Lake
...
Merge Enery Model changes for 6.9-rc1:
- Allow the Energy Model to be updated dynamically (Lukasz Luba).
* pm-em: (24 commits)
PM: EM: Fix nr_states warnings in static checks
Documentation: EM: Update with runtime modification design
PM: EM: Add em_dev_compute_costs()
PM: EM: Remove old table
PM: EM: Change debugfs configuration to use runtime EM table data
drivers/thermal/devfreq_cooling: Use new Energy Model interface
drivers/thermal/cpufreq_cooling: Use new Energy Model interface
powercap/dtpm_devfreq: Use new Energy Model interface to get table
powercap/dtpm_cpu: Use new Energy Model interface to get table
PM: EM: Optimize em_cpu_energy() and remove division
PM: EM: Support late CPUs booting and capacity adjustment
PM: EM: Add performance field to struct em_perf_state and optimize
PM: EM: Add em_perf_state_from_pd() to get performance states table
PM: EM: Introduce em_dev_update_perf_domain() for EM updates
PM: EM: Add functions for memory allocations for new EM tables
PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
PM: EM: Introduce runtime modifiable table
PM: EM: Split the allocation and initialization of the EM table
PM: EM: Check if the get_cost() callback is present in em_compute_costs()
PM: EM: Introduce em_compute_costs()
...
Since commit 43a7206b09 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the rtc_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Link: https://lore.kernel.org/r/20240305-class_cleanup-abelloni-v1-1-944c026137c8@marliere.net
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
psci_init_system_suspend() invokes suspend_set_ops() very early during
bootup even before kernel command line for mem_sleep_default is setup.
This leads to kernel command line mem_sleep_default=s2idle not working
as mem_sleep_current gets changed to deep via suspend_set_ops() and never
changes back to s2idle.
Set mem_sleep_current along with mem_sleep_default during kernel command
line setup as default suspend mode.
Fixes: faf7ec4a92 ("drivers: firmware: psci: add system suspend support")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
During the static checks nr_states has been mentioned by the kernel test
robot. Fix the warning in those 2 places.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
set_memory_ro() and set_memory_rw() can fail, leaving memory
unprotected.
Take the returned value into account and abort in case of
failure.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently the default compression algorithm is selected based on
compile time options. Introduce a module parameter "hibernate.compressor"
to override this behaviour.
Different compression algorithms have different characteristics and
hibernation may benefit when it uses any of these algorithms, especially
when a secondary algorithm(LZ4) offers better decompression speeds over
a default algorithm(LZO), which in turn reduces hibernation image
restore time.
Users can override the default algorithm in two ways:
1) Passing "hibernate.compressor" as kernel command line parameter.
Usage:
LZO: hibernate.compressor=lzo
LZ4: hibernate.compressor=lz4
2) Specifying the algorithm at runtime.
Usage:
LZO: echo lzo > /sys/module/hibernate/parameters/compressor
LZ4: echo lz4 > /sys/module/hibernate/parameters/compressor
Currently LZO and LZ4 are the supported algorithms. LZO is the default
compression algorithm used with hibernation.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The device drivers can modify EM at runtime by providing a new EM table.
The EM is used by the EAS and the em_perf_state::cost stores
pre-calculated value to avoid overhead. This patch provides the API for
device drivers to calculate the cost values properly (and not duplicate
the same code).
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove the old EM table which wasn't able to modify the data. Clean the
unneeded function and refactor the code a bit.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dump the runtime EM table values which can be modified in time. In order
to do that allocate chunk of debug memory which can be later freed
automatically thanks to devm_kcalloc().
This design can handle the fact that the EM table memory can change
after EM update, so debug code cannot use the pointer from initialization
phase.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model (EM) can be modified at runtime which brings new
possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
(EAS) in its hot path. The energy calculation uses power value for
a given performance state (ps) and the CPU busy time as percentage for that
given frequency.
It is possible to avoid the division by 'scale_cpu' at runtime, because
EM is updated whenever new max capacity CPU is set in the system.
Use that feature and do the needed division during the calculation of the
coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
multiplied simply by utilization:
pd_nrg = ps->cost * \Sum cpu_util
to get the needed energy for whole Performance Domain (PD).
With this optimization and earlier removal of map_util_freq(), the
em_cpu_energy() should run faster on the Big CPU by 1.43x and on the Little
CPU by 1.69x (RockPi 4B board).
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The patch adds needed infrastructure to handle the late CPUs boot, which
might change the previous CPUs capacity values. With this changes the new
CPUs which try to register EM will trigger the needed re-calculations for
other CPUs EMs. Thanks to that the em_per_state::performance values will
be aligned with the CPU capacity information after all CPUs finish the
boot and EM registrations.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The performance doesn't scale linearly with the frequency. Also, it may
be different in different workloads. Some CPUs are designed to be
particularly good at some applications e.g. images or video processing
and other CPUs in different. When those different types of CPUs are
combined in one SoC they should be properly modeled to get max of the HW
in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
power vs. performance curves to the EAS, but assumes the CPUs capacity
is fixed and scales linearly with the frequency. This patch allows to
adjust the curve on the 'performance' axis as well.
Code speed optimization:
Removing map_util_freq() allows to avoid one division and one
multiplication operations from the EAS hot code path.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add API function em_dev_update_perf_domain() which allows the EM to be
changed safely.
Concurrent updaters are serialized with a mutex and the removal of memory
that will not be used any more is carried out with the help of RCU.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The runtime modified EM table can be provided from drivers. Create
mechanism which allows safely allocate and free the table for device
drivers. The same table can be used by the EAS in task scheduler code
paths, so make sure the memory is not freed when the device driver module
is unloaded.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The new runtime table can be populated with a new power data to better
reflect the actual efficiency of the device e.g. CPU. The power can vary
over time e.g. due to the SoC temperature change. Higher temperature can
increase power values. For longer running scenarios, such as game or
camera, when also other devices are used (e.g. GPU, ISP) the CPU power can
change. The new EM framework is able to addresses this issue and change
the EM data at runtime safely.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Split the process of allocation and data initialization for the EM table.
The upcoming changes for modifiable EM will use it.
This change is not expected to alter the general functionality.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subsequent changes will introduce a case in which 'cb->get_cost' may
not be set in em_compute_costs(), so add a check to ensure that it is
not NULL before attempting to dereference it.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move the EM costs computation code into a new dedicated function,
em_compute_costs(), that can be reused in other places in the future.
This change is not expected to alter the general functionality.
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Energy Model might be updated at runtime and the energy efficiency
for each OPP may change. Thus, there is a need to update also the
cpufreq framework and make it aligned to the new values. In order to
do that, use a first active CPU from the Performance Domain. This is
needed since the first CPU in the cpumask might be offline when we
run this code path.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In order to prepare the code for the modifiable EM perf_state table,
make em_cpufreq_update_efficiencies() take a pointer to the EM table
as its second argument and modify it to use that new argument instead
of the 'table' member of dev->em_pd.
No functional impact.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix missing newline for the string long in the error code path.
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Extend the support for LZ4 compression to be used with hibernation.
The main idea is that different compression algorithms
have different characteristics and hibernation may benefit when it uses
any of these algorithms: a default algorithm, having higher
compression rate but is slower(compression/decompression) and a
secondary algorithm, that is faster(compression/decompression) but has
lower compression rate.
LZ4 algorithm has better decompression speeds over LZO. This reduces
the hibernation image restore time.
As per test results:
LZO LZ4
Size before Compression(bytes) 682696704 682393600
Size after Compression(bytes) 146502402 155993547
Decompression Rate 335.02 MB/s 501.05 MB/s
Restore time 4.4s 3.8s
LZO is the default compression algorithm used for hibernation. Enable
CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently for hibernation, LZO is the only compression algorithm
available and uses the existing LZO library calls. However, there
is no flexibility to switch to other algorithms which provides better
results. The main idea is that different compression algorithms have
different characteristics and hibernation may benefit when it uses
alternate algorithms.
By moving to crypto based APIs, it lays a foundation to use other
compression algorithms for hibernation. There are no functional changes
introduced by this approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Renaming lzo* to generic names, except for lzo_xxx() APIs. This is
used in the next patch where we move to crypto based APIs for
compression. There are no functional changes introduced by this
approach.
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Because dpm_save_failed_dev() may be called simultaneously by multiple
failing device PM functions, the state of the suspend_stats fields
updated by it may become inconsistent.
Prevent that from happening by using a lock in dpm_save_failed_dev().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
It is not necessary to define struct suspend_stats in a header file and the
suspend_stats variable in the core device system-wide PM code. They both
can be defined in kernel/power/main.c, next to the sysfs and debugfs code
accessing suspend_stats, which can be static.
Modify the code in question in accordance with the above observation and
replace the static inline functions manipulating suspend_stats with
regular ones defined in kernel/power/main.c.
While at it, move the enum suspend_stat_step to the end of suspend.h which
is a more suitable place for it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Change the type of the "success" and "fail" fields in struct
suspend_stats to unsigned int, because they cannot be negative.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Instead of using a set of individual struct suspend_stats fields
representing suspend step failure counters, use an array of counters
indexed by enum suspend_stat_step for this purpose, which allows
dpm_save_failed_step() to increment the appropriate counter
automatically, so that its callers don't need to do that directly.
It also allows suspend_stats_show() to carry out a loop over the
counters array to print their values.
Because the counters cannot become negative, use unsigned int for
representing them.
The only user-observable impact of this change is a different
ordering of entries in the suspend_stats debugfs file which is not
expected to matter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Replace suspend_step_name() in the suspend statistics code with an array
of suspend step names which has fewer lines of code and less overhead.
While at it, remove two unnecessary line breaks in suspend_stats_show()
and adjust some white space in there to the kernel coding style for a
more consistent code layout.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Function swsusp_close() does not have any parameters, so remove the
description of parameter @exclusive to prevent this warning.
swap.c:1573: warning: Excess function parameter 'exclusive' description in 'swsusp_close'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With the freezer changes introduced by commit f5d39b0208
("freezer,sched: Rewrite core freezer logic"), the comment in
unlock_system_sleep() has become obsolete, there is no need to
retain it.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
kmap_atomic() has been deprecated in favor of kmap_local_page().
kmap_atomic() disables page-faults and preemption (the latter
only for !PREEMPT_RT kernels).The code between the mapping and
un-mapping in this patch does not depend on the above-mentioned
side effects.So simply replaced kmap_atomic() with kmap_local_page().
Signed-off-by: Chen Haonan <chen.haonan2@zte.com.cn>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
An S4 (suspend to disk) test on the LoongArch 3A6000 platform sometimes
fails with the following error messaged in the dmesg log:
Invalid LZO compressed length
That happens because when compressing/decompressing the image, the
synchronization between the control thread and the compress/decompress/crc
thread is based on a relaxed ordering interface, which is unreliable, and the
following situation may occur:
CPU 0 CPU 1
save_image_lzo lzo_compress_threadfn
atomic_set(&d->stop, 1);
atomic_read(&data[thr].stop)
data[thr].cmp = data[thr].cmp_len;
WRITE data[thr].cmp_len
Then CPU0 gets a stale cmp_len and writes it to disk. During resume from S4,
wrong cmp_len is loaded.
To maintain data consistency between the two threads, use the acquire/release
variants of atomic set and read operations.
Fixes: 081a9d043c ("PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Co-developed-by: Weihao Li <liweihao@loongson.cn>
Signed-off-by: Weihao Li <liweihao@loongson.cn>
[ rjw: Subject rewrite and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Wakeup events that occur in the hibernation process's
hibernation_platform_enter() cannot wake up the system. Although the
current hibernation framework will execute part of the recovery process
after a wakeup event occurs, it ultimately performs a shutdown operation
because the system does not check the return value of
hibernation_platform_enter(). In short, if a wakeup event occurs before
putting the system into the final low-power state, it will be missed.
To solve this problem, check the return value of
hibernation_platform_enter(). When it returns -EAGAIN or -EBUSY (indicate
the occurrence of a wakeup event), execute the hibernation recovery
process, discard the previously saved image, and ultimately return to the
working state.
Signed-off-by: Chris Feng <chris.feng@mediatek.com>
[ rjw: Rephrase the message printed when going back to the working state ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The error variable in snapshot_write_next() gets a value before it is
used, so don't initialize it to 0 upfront.
Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
'error' first receives the function result before it is used, and it
does not need to be assigned a value during definition.
Signed-off-by: Li zeming <zeming@nfschina.com>
[ rjw: Subject rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is not necessary to intialize the error variable in
create_basic_memory_bitmaps(), because it is only read after
being assigned a value.
Signed-off-by: Wang chaodong <chaodong@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan).
- Clean up the tegra cpufreq driver (Sumit Gupta).
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring).
- Add support for TI's am62p5 Soc (Bryan Brattlof).
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli).
- Update Kconfig to mention i.MX7 as well (Alexander Stein).
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada).
- Carry out initialization of sg_cpu in the schedutil cpufreq governor
in one loop (Liao Chang).
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang).
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang).
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang).
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
- Fix buffer overflow detection in trans_stats() (Christian Marangi).
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support
specific devices like UFS which handle multiple clocks through OPP
(Operating Performance Point) framework (Manivannan Sadhasivam).
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert DT binding document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer).
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter).
- Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event
driver (Rob Herring).
- Extend support for the opp-level beyond required-opps (Ulf Hansson).
- Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru).
- dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait
SoCs and document named opp-microvolt property (Bjorn Andersson,
Dmitry Baryshkov and Christian Marangi).
- Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan Chancellor).
- General cleanup of OPP code (Viresh Kumar).
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmU6bqYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxzLcP/Avv9PgVRVqZlJ1Rs2fqIcyOY+j5qrvx
xRiO3TBwdAzRy49ItnQY4W/CHk/skGY4vFhiluZE+OTUlx1fmPKeQLFpel+1+PvW
vLezQ9v18sH7d2Kd6gJO5k9xsyu5ZMHEkwiejA/tmS2vTs5ne4wB7ONTJObYx5iB
9Hrg6jLnk7MmolQqvQB6vmpej1eeWmuu7AXlg2OsXqYsCEnhS5iGBq86E35LvlKA
Pnef/B2ZP9RaFg2dVapSZwubn0FkUtd29ifhtGC7Fw5LM8WCRc/KHAWZwMe4dcMf
38uKux28xIEalGZm9zMhKO8gHGdfF/v1C46/hBvgjavwVJF3AUNXnsfc+v5SerDp
tXx1xghGyM/blbHUdTfzZc4l5TyqsjhkBMSCMEQcj9QYjsCY0pTZmwLz8F0BAv4D
0FukGf5jK987RBGvaHY90UCE+NvokOyJDckuSHQffrAZWghnhSgbZxMD5oiIjRYR
BioM5wQsL+wOxWdUGAOVhK6wKj32kf2XjBqWdEBk70qcpbvEmc0N8t1BSd+TzzoK
qM2hnyo+yxvv98wi/cglcJeZ1mbL+s1agTh7jFTkC23ap/GrZEw0EB5xdj4NbzOk
hO1OXas8J1LA1GFwL0WoLDyY0gvGDYFWkh0yeu0SUgxTVwKapyG03OMPQATN5M/y
cp+PK3ibS8Mb
=h8my
-----END PGP SIGNATURE-----
Merge tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add new hardware support (new Qualcomm SoC versions in cpufreq,
RK3568/RK3588 in devfreq), extend the OPP (operating performance
points) framework, improve cpufreq governors, fix issues and clean up
code (most of the changes are in cpufreq and devfreq).
Specifics:
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan)
- Clean up the tegra cpufreq driver (Sumit Gupta)
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring)
- Add support for TI's am62p5 Soc (Bryan Brattlof)
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli)
- Update Kconfig to mention i.MX7 as well (Alexander Stein)
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada)
- Carry out initialization of sg_cpu in the schedutil cpufreq
governor in one loop (Liao Chang)
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang)
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang)
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang)
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois)
- Fix buffer overflow detection in trans_stats() (Christian Marangi)
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to
support specific devices like UFS which handle multiple clocks
through OPP (Operating Performance Point) framework (Manivannan
Sadhasivam)
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert DT binding document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha
Hauer)
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter)
- Use device_get_match_data() in the Samsung Exynos PPMU
devfreq-event driver (Rob Herring)
- Extend support for the opp-level beyond required-opps (Ulf Hansson)
- Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru)
- dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm
Krait SoCs and document named opp-microvolt property (Bjorn
Andersson, Dmitry Baryshkov and Christian Marangi)
- Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan
Chancellor)
- General cleanup of OPP code (Viresh Kumar)
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon)
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav)
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon)
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig)
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä)
- Change the minimum python version for the intel_pstate_tracer
utility from 2.7 to 3.6 (Doug Smythies)"
* tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (82 commits)
dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
cpufreq: qcom-nvmem: add support for IPQ8064
cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
cpufreq: qcom-nvmem: add support for IPQ6018
dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
cpufreq: qcom-nvmem: Add MSM8909
cpufreq: qcom-nvmem: Simplify driver data allocation
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
cpufreq: stats: Fix buffer overflow detection in trans_stats()
dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
PM / devfreq: rockchip-dfi: add support for RK3588
PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
PM / devfreq: rockchip-dfi: make register stride SoC specific
PM / devfreq: rockchip-dfi: Add perf support
PM / devfreq: rockchip-dfi: give variable a better name
...
snapshot_test argument is now unused in swsusp_close() and
load_image_and_restore(). Drop it
CC: linux-pm@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-17-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert hibernation code to use bdev_open_by_dev().
CC: linux-pm@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-16-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
The following crash is observed 100% of the time during resume from
the hibernation on a x86 QEMU system.
[ 12.931887] ? __die_body+0x1a/0x60
[ 12.932324] ? page_fault_oops+0x156/0x420
[ 12.932824] ? search_exception_tables+0x37/0x50
[ 12.933389] ? fixup_exception+0x21/0x300
[ 12.933889] ? exc_page_fault+0x69/0x150
[ 12.934371] ? asm_exc_page_fault+0x26/0x30
[ 12.934869] ? get_buffer.constprop.0+0xac/0x100
[ 12.935428] snapshot_write_next+0x7c/0x9f0
[ 12.935929] ? submit_bio_noacct_nocheck+0x2c2/0x370
[ 12.936530] ? submit_bio_noacct+0x44/0x2c0
[ 12.937035] ? hib_submit_io+0xa5/0x110
[ 12.937501] load_image+0x83/0x1a0
[ 12.937919] swsusp_read+0x17f/0x1d0
[ 12.938355] ? create_basic_memory_bitmaps+0x1b7/0x240
[ 12.938967] load_image_and_restore+0x45/0xc0
[ 12.939494] software_resume+0x13c/0x180
[ 12.939994] resume_store+0xa3/0x1d0
The commit being fixed introduced a bug in copying the zero bitmap
to safe pages. A temporary bitmap is allocated with PG_ANY flag in
prepare_image() to make a copy of zero bitmap after the unsafe pages
are marked. Freeing this temporary bitmap with PG_UNSAFE_KEEP later
results in an inconsistent state of unsafe pages. Since free bit is
left as is for this temporary bitmap after free, these pages are
treated as unsafe pages when they are allocated again. This results
in incorrect calculation of the number of pages pre-allocated for the
image.
nr_pages = (nr_zero_pages + nr_copy_pages) - nr_highmem - allocated_unsafe_pages;
The allocate_unsafe_pages is estimated to be higher than the actual
which results in running short of pages in safe_pages_list. Hence the
crash is observed in get_buffer() due to NULL pointer access of
safe_pages_list.
Fix this issue by creating the temporary zero bitmap from safe pages
(free bit not set) so that the corresponding free bits can be cleared
while freeing this bitmap.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Suggested-by:: Brian Geffon <bgeffon@google.com>
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The comments for both swsusp_check() and swsusp_close() don't actually
describe what they are doing.
Just removing the comments would probably better, but as the file is
full of useless kerneldoc comments for non-exported symbols this fits
in better with the style.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In snapshot_write_next(), sync_read is set and unset in three different
spots unnecessiarly. As a result there is a subtle bug where the first
page after the meta data has been loaded unconditionally sets sync_read
to 0. If this first PFN was actually a highmem page, then the returned
buffer will be the global "buffer," and the page needs to be loaded
synchronously.
That is, I'm not sure we can always assume the following to be safe:
handle->buffer = get_buffer(&orig_bm, &ca);
handle->sync_read = 0;
Because get_buffer() can call get_highmem_page_buffer() which can
return 'buffer'.
The easiest way to address this is just set sync_read before
snapshot_write_next() returns if handle->buffer == buffer.
Signed-off-by: Brian Geffon <bgeffon@google.com>
Fixes: 8357376d3d ("[PATCH] swsusp: Improve handling of highmem")
Cc: All applicable <stable@vger.kernel.org>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We found at least one situation where the safe pages list was empty and
get_buffer() would gladly try to use a NULL pointer.
Signed-off-by: Brian Geffon <bgeffon@google.com>
Fixes: 8357376d3d ("[PATCH] swsusp: Improve handling of highmem")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>