A set of scheduler updates:

- Prevent PSI state corruption when schedule() races with cgroup move.  A
    recent commit combined two PSI callbacks to reduce the number of cgroup
    tree updates, but missed that schedule() can drop rq::lock for load
    balancing, which opens the race window for cgroup_move_task() which then
    observes half updated state. The fix is to solely use task::ps_flags
    instead of looking at the potentially mismatching scheduler state
 
  - Prevent an out-of-bounds access in uclamp caused bu a rounding division
    which can lead to an off-by-one error exceeding the buckets array size.
 
  - Prevent unfairness caused by missing load decay when a task is attached
    to a cfs runqueue. The old load of the task is attached to the runqueue
    and never removed. Fix it by enforcing the load update through the
    hierarchy for unthrottled run queue instances.
 
  - A documentation fix fot the 'sched_verbose' command line option
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCX6VATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXYxEADDmfbQt0QdUHJB95QdunDseL5U787L
 qPZ8wUpl+a2IBgqiD1at2IvWJNeivW3GNoRl4lMYOzX/3/Eh5AApfpxeDBtckht2
 sUfcduPDk9rrocCP/dLtQK3vVIoZWZladRsYT8K53l68roT6T2+Qkrwd5OtyhfPc
 apwIvknVbQ3exUq/OmXtyc0oLLJJ1lyeteJ0ZdIcTuMbeM9IhG8Tm2v3Rh3G0Ic2
 eBHOGNIjQWtvP55TyiwtWj35MaXCvy8c7my6YpffjjgLX1X0Tro3/7Jnzo16rsyt
 6yR8G4gBtrW+pPH+LNUk45Wpp51B4p1EjpMAMApA94Z9yIxsIip8PoKV6EN2+8sh
 K3cfSQlQubXilWNSRQSx/gQLkSXr8Y/wexajcOzycTXw+ifh6biseFCYPTgwkxDB
 FKJAdePvh6dntk/2DB5gvRaZY1HI5L+Iv8neiQfHttUPcXYRgSOs9V7k80j+qyqE
 QV/vlImZRTW0fiqtWS9ZAFRNGzq/QB/UKp+znDQoUVBE4zxB9nekVDqsCTz4H5n9
 oBIIj/xwMfqVojKSH72leK64O1/+ucX9l4/Qxcs4E6LZjYRQL9tmCoRBLZ1uyQ9S
 Ee9wpz6TIX9J5Dgr1gYs1WNaheC1Xonu5JtU4ysWUX3jLdBSnJD5vY9OD13rKUV7
 eGJKjI979fVhiA==
 =L3ar
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler updates:

   - Prevent PSI state corruption when schedule() races with cgroup
     move.

     A recent commit combined two PSI callbacks to reduce the number of
     cgroup tree updates, but missed that schedule() can drop rq::lock
     for load balancing, which opens the race window for
     cgroup_move_task() which then observes half updated state.

     The fix is to solely use task::ps_flags instead of looking at the
     potentially mismatching scheduler state

   - Prevent an out-of-bounds access in uclamp caused bu a rounding
     division which can lead to an off-by-one error exceeding the
     buckets array size.

   - Prevent unfairness caused by missing load decay when a task is
     attached to a cfs runqueue.

     The old load of the task was attached to the runqueue and never
     removed. Fix it by enforcing the load update through the hierarchy
     for unthrottled run queue instances.

   - A documentation fix fot the 'sched_verbose' command line option"

* tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unfairness caused by missing load decay
  sched: Fix out-of-bound access in uclamp
  psi: Fix psi state corruption when schedule() races with cgroup move
  sched,doc: sched_debug_verbose cmdline should be sched_verbose
This commit is contained in:
Linus Torvalds 2021-05-09 13:14:34 -07:00
commit 9819f682e4
4 changed files with 37 additions and 15 deletions

View File

@ -74,7 +74,7 @@ for a given topology level by creating a sched_domain_topology_level array and
calling set_sched_topology() with this array as the parameter.
The sched-domains debugging infrastructure can be enabled by enabling
CONFIG_SCHED_DEBUG and adding 'sched_debug_verbose' to your cmdline. If you
CONFIG_SCHED_DEBUG and adding 'sched_verbose' to your cmdline. If you
forgot to tweak your cmdline, you can also flip the
/sys/kernel/debug/sched/verbose knob. This enables an error checking parse of
the sched domains which should catch most possible errors (described above). It

View File

@ -938,7 +938,7 @@ DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
{
return clamp_value / UCLAMP_BUCKET_DELTA;
return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
}
static inline unsigned int uclamp_none(enum uclamp_id clamp_id)

View File

@ -10878,16 +10878,22 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq;
list_add_leaf_cfs_rq(cfs_rq_of(se));
/* Start to propagate at parent */
se = se->parent;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
if (cfs_rq_throttled(cfs_rq))
break;
if (!cfs_rq_throttled(cfs_rq)){
update_load_avg(cfs_rq, se, UPDATE_TG);
list_add_leaf_cfs_rq(cfs_rq);
continue;
}
update_load_avg(cfs_rq, se, UPDATE_TG);
if (list_add_leaf_cfs_rq(cfs_rq))
break;
}
}
#else

View File

@ -972,7 +972,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
*/
void cgroup_move_task(struct task_struct *task, struct css_set *to)
{
unsigned int task_flags = 0;
unsigned int task_flags;
struct rq_flags rf;
struct rq *rq;
@ -987,15 +987,31 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
rq = task_rq_lock(task, &rf);
if (task_on_rq_queued(task)) {
task_flags = TSK_RUNNING;
if (task_current(rq, task))
task_flags |= TSK_ONCPU;
} else if (task->in_iowait)
task_flags = TSK_IOWAIT;
if (task->in_memstall)
task_flags |= TSK_MEMSTALL;
/*
* We may race with schedule() dropping the rq lock between
* deactivating prev and switching to next. Because the psi
* updates from the deactivation are deferred to the switch
* callback to save cgroup tree updates, the task's scheduling
* state here is not coherent with its psi state:
*
* schedule() cgroup_move_task()
* rq_lock()
* deactivate_task()
* p->on_rq = 0
* psi_dequeue() // defers TSK_RUNNING & TSK_IOWAIT updates
* pick_next_task()
* rq_unlock()
* rq_lock()
* psi_task_change() // old cgroup
* task->cgroups = to
* psi_task_change() // new cgroup
* rq_unlock()
* rq_lock()
* psi_sched_switch() // does deferred updates in new cgroup
*
* Don't rely on the scheduling state. Use psi_flags instead.
*/
task_flags = task->psi_flags;
if (task_flags)
psi_task_change(task, task_flags, 0);