Commit Graph

32420 Commits

Author SHA1 Message Date
Christian Brauner
b612e5df45
clone3: add CLONE_CLEAR_SIGHAND
Reset all signal handlers of the child not set to SIG_IGN to SIG_DFL.
Mutually exclusive with CLONE_SIGHAND to not disturb other thread's
signal handler.

In the spirit of closer cooperation between glibc developers and kernel
developers (cf. [2]) this patchset came out of a discussion on the glibc
mailing list for improving posix_spawn() (cf. [1], [3], [4]). Kernel
support for this feature has been explicitly requested by glibc and I
see no reason not to help them with this.

The child helper process on Linux posix_spawn must ensure that no signal
handlers are enabled, so the signal disposition must be either SIG_DFL
or SIG_IGN. However, it requires a sigprocmask to obtain the current
signal mask and at least _NSIG sigaction calls to reset the signal
handlers for each posix_spawn call or complex state tracking that might
lead to data corruption in glibc. Adding this flags lets glibc avoid
these problems.

[1]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00149.html
[3]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00158.html
[4]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00160.html
[2]: https://lwn.net/Articles/799331/
     '[...] by asking for better cooperation with the C-library projects
     in general. They should be copied on patches containing ABI
     changes, for example. I noted that there are often times where
     C-library developers wish the kernel community had done things
     differently; how could those be avoided in the future? Members of
     the audience suggested that more glibc developers should perhaps
     join the linux-api list. The other suggestion was to "copy Florian
     on everything".'
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
Cc: linux-api@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191014104538.3096-1-christian.brauner@ubuntu.com
2019-10-21 21:46:47 +02:00
Thomas Richter
5e6c3c7b1e perf/aux: Fix tracking of auxiliary trace buffer allocation
The following commit from the v5.4 merge window:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

  [52.250027] perf_mmap user->locked_vm:0x87 pinned_vm:0x0 ret:0
  [52.250115] perf_mmap user->locked_vm:0x107 pinned_vm:0x0 ret:0
  [52.250251] perf_mmap user->locked_vm:0x188 pinned_vm:0x0 ret:0
  [52.250326] perf_mmap user->locked_vm:0x208 pinned_vm:0x0 ret:0
  [52.250441] perf_mmap user->locked_vm:0x289 pinned_vm:0x0 ret:0
  [52.250498] perf_mmap user->locked_vm:0x309 pinned_vm:0x0 ret:0
  [52.250613] perf_mmap user->locked_vm:0x38a pinned_vm:0x0 ret:0
  [52.250715] perf_mmap user->locked_vm:0x408 pinned_vm:0x2 ret:0
  [52.250834] perf_mmap user->locked_vm:0x408 pinned_vm:0x83 ret:0
  [52.250915] perf_mmap user->locked_vm:0x408 pinned_vm:0x103 ret:0
  [52.251061] perf_mmap user->locked_vm:0x408 pinned_vm:0x184 ret:0
  [52.251146] perf_mmap user->locked_vm:0x408 pinned_vm:0x204 ret:0
  [52.251299] perf_mmap user->locked_vm:0x408 pinned_vm:0x285 ret:0
  [52.251383] perf_mmap user->locked_vm:0x408 pinned_vm:0x305 ret:0
  [52.251544] perf_mmap user->locked_vm:0x408 pinned_vm:0x386 ret:0
  [52.251634] perf_mmap user->locked_vm:0x408 pinned_vm:0x406 ret:0
  [52.253018] perf_mmap user->locked_vm:0x408 pinned_vm:0x487 ret:0
  [52.253197] perf_mmap user->locked_vm:0x408 pinned_vm:0x508 ret:0
  [52.253374] perf_mmap user->locked_vm:0x408 pinned_vm:0x589 ret:0
  [52.253550] perf_mmap user->locked_vm:0x408 pinned_vm:0x60a ret:0
  [52.253726] perf_mmap user->locked_vm:0x408 pinned_vm:0x68b ret:0
  [52.253903] perf_mmap user->locked_vm:0x408 pinned_vm:0x70c ret:0
  [52.254084] perf_mmap user->locked_vm:0x408 pinned_vm:0x78d ret:0
  [52.254263] perf_mmap user->locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user->locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

  [64.267797] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x78d
  [64.267826] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x70c
  [64.267848] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x68b
  [64.267869] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x60a
  [64.267891] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x589
  [64.267911] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x508
  [64.267933] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x487
  [64.267952] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x406
  [64.268883] perf_mmap_close mmap_user->locked_vm:0x307 pinned_vm:0x406
  [64.269117] perf_mmap_close mmap_user->locked_vm:0x206 pinned_vm:0x406
  [64.269433] perf_mmap_close mmap_user->locked_vm:0x105 pinned_vm:0x406
  [64.269536] perf_mmap_close mmap_user->locked_vm:0x4 pinned_vm:0x404
  [64.269797] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff84 pinned_vm:0x303
  [64.270105] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff04 pinned_vm:0x202
  [64.270374] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe84 pinned_vm:0x101
  [64.270628] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 11:31:24 +02:00
Vincent Guittot
57abff067a sched/fair: Rework find_idlest_group()
The slow wake up path computes per sched_group statisics to select the
idlest group, which is quite similar to what load_balance() is doing
for selecting busiest group. Rework find_idlest_group() to classify the
sched_group and select the idlest one following the same steps as
load_balance().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
fc1273f4ce sched/fair: Optimize find_idlest_group()
find_idlest_group() now reads CPU's load_avg in two different ways.

Consolidate the function to read and use load_avg only once and simplify
the algorithm to only look for the group with lowest load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
11f10e5420 sched/fair: Use load instead of runnable load in wakeup path
Runnable load was originally introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts looking for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance() logic.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
c63be7be59 sched/fair: Use utilization to select misfit task
Utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Check that task can't fit the CPU's capacity when selecting the misfit
task instead of using the load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
2ab4092fc8 sched/fair: Spread out tasks evenly when not overloaded
When there is only one CPU per group, using the idle CPUs to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
b0fb1eb4f0 sched/fair: Use load instead of runnable load in load_balance()
'runnable load' was originally introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized groups with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wake up on the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
5e23e47443 sched/fair: Use rq->nr_running when balancing load
CFS load_balance() only takes care of CFS tasks whereas CPUs can be used by
other scheduling classes. Typically, a CFS task preempted by an RT or deadline
task will not get a chance to be pulled by another CPU because
load_balance() doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situations.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
0b0695f2b3 sched/fair: Rework load_balance()
The load_balance() algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios.

For instance, in the presence of idle CPUs, we should simply try to get at
least one task per CPU, whereas the current load-based algorithm can actually
leave idle CPUs alone simply because the load is somewhat balanced.

The current algorithm ends up creating virtual and meaningless values like
the avg_load_per_task or tweaks the state of a group to make it overloaded
whereas it's not, in order to try to migrate tasks.

load_balance() should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have:

	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_packing
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:

	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance() algorithm fixes several pending wrong tasks
placement:

 - the 1 task per CPU case with asymmetric system
 - the case of cfs task preempted by other class
 - the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:

 - update_sd_pick_busiest() select the busiest sched_group.
 - find_busiest_group() checks if there is an imbalance between local and
   busiest group.
 - calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-5-git-send-email-vincent.guittot@linaro.org
[ Small readability and spelling updates. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
fcf0553db6 sched/fair: Remove meaningless imbalance calculation
Clean up load_balance() and remove meaningless calculation and fields before
adding a new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
a349834703 sched/fair: Rename sg_lb_stats::sum_nr_running to sum_h_nr_running
Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There are no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
490ba971d8 sched/fair: Clean up asym packing
Clean up asym packing to follow the default load balance behavior:

- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Rafael J. Wysocki
77751a466e PM: QoS: Introduce frequency QoS
Introduce frequency QoS, based on the "raw" low-level PM QoS, to
represent min and max frequency requests and aggregate constraints.

The min and max frequency requests are to be represented by
struct freq_qos_request objects and the aggregate constraints are to
be represented by struct freq_constraints objects.  The latter are
expected to be initialized with the help of freq_constraints_init().

The freq_qos_read_value() helper is defined to retrieve the aggregate
constraints values from a given struct freq_constraints object and
there are the freq_qos_add_request(), freq_qos_update_request() and
freq_qos_remove_request() helpers to manipulate the min and max
frequency requests.  It is assumed that the the helpers will not
run concurrently with each other for the same struct freq_qos_request
object, so if that may be the case, their uses must ensure proper
synchronization between them (e.g. through locking).

In addition, freq_qos_add_notifier() and freq_qos_remove_notifier()
are provided to add and remove notifiers that will trigger on aggregate
constraint changes to and from a given struct freq_constraints object,
respectively.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-21 02:05:21 +02:00
Andy Whitcroft
da6043fe85 PM / hibernate: memory_bm_find_bit(): Tighten node optimisation
When looking for a bit by number we make use of the cached result from the
preceding lookup to speed up operation.  Firstly we check if the requested
pfn is within the cached zone and if not lookup the new zone.  We then
check if the offset for that pfn falls within the existing cached node.
This happens regardless of whether the node is within the zone we are
now scanning.  With certain memory layouts it is possible for this to
false trigger creating a temporary alias for the pfn to a different bit.
This leads the hibernation code to free memory which it was never allocated
with the expected fallout.

Ensure the zone we are scanning matches the cached zone before considering
the cached node.

Deep thanks go to Andrea for many, many, many hours of hacking and testing
that went into cornering this bug.

Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-21 01:24:27 +02:00
David S. Miller
2f184393e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Several cases of overlapping changes which were for the most
part trivially resolvable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-20 10:43:00 -07:00
Linus Torvalds
e2ab4ef83f Kbuild fixes for v5.4 (2nd)
- fix a bashism of setlocalversion
 
  - do not use the too new --sort option of tar
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2sh9MeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGO90P/13dNgaibb3JcCvp
 +ugn+o26V+aqZ8uDLPPifUNFLLnDfGmgORAOe0Z2mGI2CqMDaWMYjvyZNF1KzOdD
 +WFZubiTUevYyM9AOG6O6EOGEDRumgvdNhnJIP+aCUNKGM50zzhuAa23y6Xip1NZ
 qJY5AYu9KiH9jTXtZAdGReuMdvNRsMZuvmI4Qnc21utItVxDFEQKGdTlRoJaGEdN
 Zc/f1HzWwlr7VWZiy0iMRSCAAtAA3TN3S1bw5QCYsuJb3LRl2EIOKMeSinRTYWRc
 zU5CHrkRSX4/M/o8VantTOZueuGMfbaOAl1KEDh8DUPDcrl+gmL4s8aFuDW+5d2+
 7pPlEe/FNZOAGVBDgiK59N3bl2Rn/Uyg9VGwFc8PU3VeY5VbS97ayBaLFcua2zwX
 88mtsZoq+fN3qsq8gfbb21qauTbEibyrk8lzjHoQl07TW82GFWG/GXSt9dQAm1sb
 w/M15WdaONKOzkYXGGCo0l6vdXx1+qqV/LZQcaWhfcoeU5YVVEmSRwCakCxxSWUE
 rvWtn0vljgH8kyb0gnUP6qLJC9clwMv58sl/mG65/D7zs9B74g16BkX0tU8vLSpr
 Ww1BpGtPwuFPp7+kEzWemtlulhDvow2xIX7bNIkAk1IIVpyCmU98AKpivR7G9Cvt
 4cZPSGRU+s2ou/hrtjryzvmwhGUG
 =KNWF
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild fixes from Masahiro Yamada:

 - fix a bashism of setlocalversion

 - do not use the too new --sort option of tar

* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: substituting --sort in archive creation
  scripts: setlocalversion: fix a bashism
  kbuild: update comment about KBUILD_ALLDIRS
2019-10-20 12:36:57 -04:00
Linus Torvalds
188768f3c0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hrtimer fixlet from Thomas Gleixner:
 "A single commit annotating the lockcless access to timer->base with
  READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Annotate lockless access to timer->base
2019-10-20 06:25:12 -04:00
Linus Torvalds
589f1222e0 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stop-machine fix from Thomas Gleixner:
 "A single fix, amending stop machine with WRITE/READ_ONCE() to address
  the fallout of KCSAN"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine: Avoid potential race behaviour
2019-10-20 06:22:25 -04:00
Song Liu
aa5de305c9 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe.  This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Toke Høiland-Jørgensen
05679ca6fe xdp: Prevent overflow in devmap_hash cost calculation for 32-bit builds
Tetsuo pointed out that without an explicit cast, the cost calculation for
devmap_hash type maps could overflow on 32-bit builds. This adds the
missing cast.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191017105702.2807093-1-toke@redhat.com
2019-10-18 16:18:42 -07:00
YueHaibing
1f5343c0ae bpf: Fix build error without CONFIG_NET
If CONFIG_NET is n, building fails:

kernel/trace/bpf_trace.o: In function `raw_tp_prog_func_proto':
bpf_trace.c:(.text+0x1a34): undefined reference to `bpf_skb_output_proto'

Wrap it into a #ifdef to fix this.

Fixes: a7658e1a41 ("bpf: Check types of arguments passed into helpers")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191018090344.26936-1-yuehaibing@huawei.com
2019-10-18 20:57:07 +02:00
Alexei Starovoitov
c108e3c1bd bpf: Fix bpf_attr.attach_btf_id check
Only raw_tracepoint program type can have bpf_attr.attach_btf_id >= 0.
Make sure to reject other program types that accidentally set it to non-zero.

Fixes: ccfe29eb29 ("bpf: Add attach_btf_id attribute to program load")
Reported-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191018060933.2950231-1-ast@kernel.org
2019-10-18 20:55:54 +02:00
Zhengjun Xing
9fa8c9c647 tracing: Fix "gfp_t" format for synthetic events
In the format of synthetic events, the "gfp_t" is shown as "signed:1",
but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".

The issue can be reproduced by the following commands:

echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events
cat  /sys/kernel/debug/tracing/events/synthetic/memlatency/format

name: memlatency
ID: 2233
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:u64 lat;  offset:8;       size:8; signed:0;
        field:unsigned int order;       offset:16;      size:4; signed:0;
        field:gfp_t gfp_flags;  offset:24;      size:4; signed:1;
        field:int migratetype;  offset:32;      size:4; signed:1;

print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype

Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-18 14:42:53 -04:00
Linus Torvalds
e59b76ff67 Power management fixes for 5.4-rc4
- Fix possible NULL pointer dereference in the ACPI processor
    scaling initialization code introduced by a recent cpufreq
    update (Rafael Wysocki).
 
  - Fix possible deadlock due to suspending cpufreq too late during
    system shutdown (Rafael Wysocki).
 
  - Make the PCI device system resume code path be more consistent
    with its PM-runtime counterpart to fix an issue with missing
    delay on transitions from D3cold to D0 during system resume from
    suspend-to-idle on some systems (Rafael Wysocki).
 
  - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
    use suspend-to-idle by default (Mario Limonciello).
 
  - Fix build warning in the core system suspend support code (Ben
    Dooks).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2pgDcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxsCQQAJ6L+CC5kEmjfkFeUvCR7xe37wEdeCYc
 NL13rDuKAKQDK6ayJtHp6LUPpSiotOzvjohFplLyn8dAZ7x8V0/KOOGeoL35GSlN
 u18AxkRXT79AzrW1SoDm2TnCCZYV0aW5RupGzPd4vso+E5Q5S3e9Ugk3PBxQfTYH
 YKOr7SpBBICma8ZgplR4tUTLb8brhR/NUAIOVyze52gUKh366kZbDD5tn8eHUxXq
 CBj4ryyV88aeQH2sM1sfvdIlDgFgHrO5GOzcDCCKrEtddqP/hSZyHU+OCH35Vu1v
 it96AEx58IEmd8I18tNZtu5rm1b89YkHeuoh/UF92zCYvk4RVg+Dw3AF0NZVjMQo
 vO7xyO93H48TClHcCwxIOhdI16bXidYfVR1JTeTINvaR7mNZtUddSCEFVKwCazbP
 FC55pVPeUx+HbC6TBQJ9gOSxO2yc2f+uLUXfnKgOpeFZpQuX/WhzuWqUlXgBF2QK
 FHOdfVvX1p1tE9U+3sMLyx1Sln+gqtd2P6GnGzewqIAl43LyJBC4l1s+nkE1b5fc
 7pa0rEqLu4NtGj5O+YQvp+luTQ9FcVL9K5tTuqV4V1Kt+/bzjfIvaEJpwHs44yrZ
 tP2/YCCiA84Hy8SqS4Txk+D14XVYuRguTMBzcmWj2POKFTbDsIhC9bAEZgfJZJNK
 /S2jf5CTnZtl
 =67IL
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These include a fix for a recent regression in the ACPI CPU
performance scaling code, a PCI device power management fix,
a system shutdown fix related to cpufreq, a removal of an ACPI
suspend-to-idle blacklist entry and a build warning fix.

Specifics:

   - Fix possible NULL pointer dereference in the ACPI processor scaling
     initialization code introduced by a recent cpufreq update (Rafael
     Wysocki).

   - Fix possible deadlock due to suspending cpufreq too late during
     system shutdown (Rafael Wysocki).

   - Make the PCI device system resume code path be more consistent with
     its PM-runtime counterpart to fix an issue with missing delay on
     transitions from D3cold to D0 during system resume from
     suspend-to-idle on some systems (Rafael Wysocki).

   - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
     use suspend-to-idle by default (Mario Limonciello).

   - Fix build warning in the core system suspend support code (Ben
     Dooks)"

* tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Avoid NULL pointer dereferences at init time
  PCI: PM: Fix pci_power_up()
  PM: sleep: include <linux/pm_runtime.h> for pm_wq
  cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown
  ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-18 08:34:04 -07:00
Kefeng Wang
3da2e1fd46 trace: Use pr_warn instead of pr_warning
As said in commit f2c2cbcc35 ("powerpc: Use pr_warn instead of
pr_warning"), removing pr_warning so all logging messages use a
consistent <prefix>_warn style. Let's do it.

Link: http://lkml.kernel.org/r/20191018031850.48498-26-wangkefeng.wang@huawei.com
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-10-18 15:01:57 +02:00
Kefeng Wang
fc65104c7c dma-debug: Use pr_warn instead of pr_warning
As said in commit f2c2cbcc35 ("powerpc: Use pr_warn instead of
pr_warning"), removing pr_warning so all logging messages use a
consistent <prefix>_warn style. Let's do it.

Link: http://lkml.kernel.org/r/20191018031850.48498-25-wangkefeng.wang@huawei.com
To: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-10-18 15:01:56 +02:00
Rafael J. Wysocki
b23eb5c74e Merge branches 'pm-cpufreq' and 'pm-sleep'
* pm-cpufreq:
  ACPI: processor: Avoid NULL pointer dereferences at init time
  cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown

* pm-sleep:
  PM: sleep: include <linux/pm_runtime.h> for pm_wq
  ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-18 10:27:55 +02:00
Yunfeng Ye
d7e78706e4 perf/ring_buffer: Matching the memory allocate and free, in rb_alloc()
Currently perf_mmap_alloc_page() is used to allocate memory in
rb_alloc(), but using free_page() to free memory in the failure path.

It's better to use perf_mmap_free_page() instead.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/575c7e8c-90c7-4e3a-b41d-f894d8cdbd7f@huawei.com
2019-10-17 21:31:55 +02:00
Yunfeng Ye
8a9f91c51e perf/ring_buffer: Modify the parameter type of perf_mmap_free_page()
In perf_mmap_free_page(), the unsigned long type is converted to the
pointer type, but where the call is made, the pointer type is converted
to the unsigned long type. There is no need to do these operations.

Modify the parameter type of perf_mmap_free_page() to pointer type.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/e6ae3f0c-d04c-50f9-544a-aee3b30330cd@huawei.com
2019-10-17 21:31:55 +02:00
Joel Fernandes (Google)
da97e18458 perf_event: Add support for LSM and SELinux checks
In current mainline, the degree of access to perf_event_open(2) system
call depends on the perf_event_paranoid sysctl.  This has a number of
limitations:

1. The sysctl is only a single value. Many types of accesses are controlled
   based on the single value thus making the control very limited and
   coarse grained.
2. The sysctl is global, so if the sysctl is changed, then that means
   all processes get access to perf_event_open(2) opening the door to
   security issues.

This patch adds LSM and SELinux access checking which will be used in
Android to access perf_event_open(2) for the purposes of attaching BPF
programs to tracepoints, perf profiling and other operations from
userspace. These operations are intended for production systems.

5 new LSM hooks are added:
1. perf_event_open: This controls access during the perf_event_open(2)
   syscall itself. The hook is called from all the places that the
   perf_event_paranoid sysctl is checked to keep it consistent with the
   systctl. The hook gets passed a 'type' argument which controls CPU,
   kernel and tracepoint accesses (in this context, CPU, kernel and
   tracepoint have the same semantics as the perf_event_paranoid sysctl).
   Additionally, I added an 'open' type which is similar to
   perf_event_paranoid sysctl == 3 patch carried in Android and several other
   distros but was rejected in mainline [1] in 2016.

2. perf_event_alloc: This allocates a new security object for the event
   which stores the current SID within the event. It will be useful when
   the perf event's FD is passed through IPC to another process which may
   try to read the FD. Appropriate security checks will limit access.

3. perf_event_free: Called when the event is closed.

4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.

5. perf_event_write: Called from the ioctl(2) syscalls for the event.

[1] https://lwn.net/Articles/696240/

Since Peter had suggest LSM hooks in 2016 [1], I am adding his
Suggested-by tag below.

To use this patch, we set the perf_event_paranoid sysctl to -1 and then
apply selinux checking as appropriate (default deny everything, and then
add policy rules to give access to domains that need it). In the future
we can remove the perf_event_paranoid sysctl altogether.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: rostedt@goodmis.org
Cc: Yonghong Song <yhs@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: jeffv@google.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: primiano@google.com
Cc: Song Liu <songliubraving@fb.com>
Cc: rsavitski@google.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Matthew Garrett <matthewgarrett@google.com>
Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org
2019-10-17 21:31:55 +02:00
Valentin Schneider
9ae7ab20b4 sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax
As pointed out in commit

  182a85f8a1 ("sched: Disable wakeup balancing")

SD_BALANCE_WAKE is a tad too aggressive, and is usually left unset.

However, it turns out cpuset domain relaxation will unconditionally set it
on domains below the relaxation level. This made sense back when
SD_BALANCE_WAKE was set unconditionally, but it no longer is the case.

We can improve things slightly by noticing that set_domain_attribute() is
always called after sd_init(), so rather than setting flags we can rely on
whatever sd_init() is doing and only clear certain flags when above the
relaxation level.

While at it, slightly clean up the function and flip the relax level
check to be more human readable.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@kernel.org
Cc: vincent.guittot@linaro.org
Cc: juri.lelli@redhat.com
Cc: seto.hidetoshi@jp.fujitsu.com
Cc: qperret@google.com
Cc: Dietmar.Eggemann@arm.com
Cc: morten.rasmussen@arm.com
Link: https://lkml.kernel.org/r/20191014164408.32596-1-valentin.schneider@arm.com
2019-10-17 21:31:54 +02:00
Alexei Starovoitov
a7658e1a41 bpf: Check types of arguments passed into helpers
Introduce new helper that reuses existing skb perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct sk_buff *' as tracepoint argument or
can walk other kernel data structures to skb pointer.

In order to do that teach verifier to resolve true C types
of bpf helpers into in-kernel BTF ids.
The type of kernel pointer passed by raw tracepoint into bpf
program will be tracked by the verifier all the way until
it's passed into helper function.
For example:
kfree_skb() kernel function calls trace_kfree_skb(skb, loc);
bpf programs receives that skb pointer and may eventually
pass it into bpf_skb_output() bpf helper which in-kernel is
implemented via bpf_skb_event_output() kernel function.
Its first argument in the kernel is 'struct sk_buff *'.
The verifier makes sure that types match all the way.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
3dec541b2e bpf: Add support for BTF pointers to x86 JIT
Pointer to BTF object is a pointer to kernel object or NULL.
Such pointers can only be used by BPF_LDX instructions.
The verifier changed their opcode from LDX|MEM|size
to LDX|PROBE_MEM|size to make JITing easier.
The number of entries in extable is the number of BPF_LDX insns
that access kernel memory via "pointer to BTF type".
Only these load instructions can fault.
Since x86 extable is relative it has to be allocated in the same
memory region as JITed code.
Allocate it prior to last pass of JITing and let the last pass populate it.
Pointer to extable in bpf_prog_aux is necessary to make page fault
handling fast.
Page fault handling is done in two steps:
1. bpf_prog_kallsyms_find() finds BPF program that page faulted.
   It's done by walking rb tree.
2. then extable for given bpf program is binary searched.
This process is similar to how page faulting is done for kernel modules.
The exception handler skips over faulting x86 instruction and
initializes destination register with zero. This mimics exact
behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed).

JITs for other architectures can add support in similar way.
Until then they will reject unknown opcode and fallback to interpreter.

Since extable should be aligned and placed near JITed code
make bpf_jit_binary_alloc() return 4 byte aligned image offset,
so that extable aligning formula in bpf_int_jit_compile() doesn't need
to rely on internal implementation of bpf_jit_binary_alloc().
On x86 gcc defaults to 16-byte alignment for regular kernel functions
due to better performance. JITed code may be aligned to 16 in the future,
but it will use 4 in the meantime.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
2a02759ef5 bpf: Add support for BTF pointers to interpreter
Pointer to BTF object is a pointer to kernel object or NULL.
The memory access in the interpreter has to be done via probe_kernel_read
to avoid page faults.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-9-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
ac4414b5ca bpf: Attach raw_tp program with BTF via type name
BTF type id specified at program load time has all
necessary information to attach that program to raw tracepoint.
Use kernel type name to find raw tracepoint.

Add missing CHECK_ATTR() condition.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-8-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
9e15db6613 bpf: Implement accurate raw_tp context access via BTF
libbpf analyzes bpf C program, searches in-kernel BTF for given type name
and stores it into expected_attach_type.
The kernel verifier expects this btf_id to point to something like:
typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
which represents signature of raw_tracepoint "kfree_skb".

Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
and 'void *' in second case.

Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
Like PTR_TO_SOCKET points to 'struct bpf_sock',
PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
PTR_TO_BTF_ID points to in-kernel structs.
If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.

When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
the btf_struct_access() checks which field of 'struct sk_buff' is
at offset 32. Checks that size of access matches type definition
of the field and continues to track the dereferenced type.
If that field was a pointer to 'struct net_device' the r2's type
will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
in vmlinux's BTF.

Such verifier analysis prevents "cheating" in BPF C program.
The program cannot cast arbitrary pointer to 'struct sk_buff *'
and access it. C compiler would allow type cast, of course,
but the verifier will notice type mismatch based on BPF assembly
and in-kernel BTF.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
ccfe29eb29 bpf: Add attach_btf_id attribute to program load
Add attach_btf_id attribute to prog_load command.
It's similar to existing expected_attach_type attribute which is
used in several cgroup based program types.
Unfortunately expected_attach_type is ignored for
tracing programs and cannot be reused for new purpose.
Hence introduce attach_btf_id to verify bpf programs against
given in-kernel BTF type id at load time.
It is strictly checked to be valid for raw_tp programs only.
In a later patches it will become:
btf_id == 0 semantics of existing raw_tp progs.
btd_id > 0 raw_tp with BTF and additional type safety.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-5-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
8580ac9404 bpf: Process in-kernel BTF
If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux'
for further use by the verifier.
In-kernel BTF is trusted just like kallsyms and other build artifacts
embedded into vmlinux.
Yet run this BTF image through BTF verifier to make sure
that it is valid and it wasn't mangled during the build.
If in-kernel BTF is incorrect it means either gcc or pahole or kernel
are buggy. In such case disallow loading BPF programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
2019-10-17 16:44:35 +02:00
Christian Brauner
1e1d0f0b1a
pid: use pid_has_task() in pidfd_open()
Use the new pid_has_task() helper in pidfd_open(). This simplifies the
code and avoids taking rcu_read_{lock,unlock}() and leads to overall
nicer code.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-5-christian.brauner@ubuntu.com
2019-10-17 15:37:00 +02:00
Christian Brauner
1722c14a20
exit: use pid_has_task() in do_wait()
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-4-christian.brauner@ubuntu.com
2019-10-17 15:36:58 +02:00
Christian Brauner
1d416a113f
pid: use pid_has_task() in __change_pid()
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this
check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-3-christian.brauner@ubuntu.com
2019-10-17 15:36:56 +02:00
Christian Brauner
3d6d8da48d
pidfd: check pid has attached task in fdinfo
Currently, when a task is dead we still print the pid it used to use in
the fdinfo files of its pidfds. This doesn't make much sense since the
pid may have already been reused. So verify that the task is still alive
by introducing the pid_has_task() helper which will be used by other
callers in follow-up patches.
If the task is not alive anymore, we will print -1. This allows us to
differentiate between a task not being present in a given pid namespace
- in which case we already print 0 - and a task having been reaped.

Note that this uses PIDTYPE_PID for the check. Technically, we could've
checked PIDTYPE_TGID since pidfds currently only refer to thread-group
leaders but if they won't anymore in the future then this check becomes
problematic without it being immediately obvious to non-experts imho. If
a thread is created via clone(CLONE_THREAD) than struct pid has a single
non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
though the thread-group leader might still be very much alive. So
checking PIDTYPE_PID is fine and is easier to maintain should we ever
allow pidfds to refer to threads.

Cc: Jann Horn <jannh@google.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: linux-api@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com
2019-10-17 15:36:50 +02:00
Mark Rutland
b1fc583335 stop_machine: Avoid potential race behaviour
Both multi_cpu_stop() and set_state() access multi_stop_data::state
racily using plain accesses. These are subject to compiler
transformations which could break the intended behaviour of the code,
and this situation is detected by KCSAN on both arm64 and x86 (splats
below).

Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the
compiler cannot elide, replay, or tear loads and stores.

In multi_cpu_stop() the two loads of multi_stop_data::state are expected to
be a consistent value, so snapshot the value into a temporary variable to
ensure this.

The state transitions are serialized by atomic manipulation of
multi_stop_data::num_threads, and other fields in multi_stop_data are not
modified while subject to concurrent reads.

KCSAN splat on arm64:

| BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
|
| write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
|  set_state+0x80/0xb0
|  multi_cpu_stop+0x16c/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
|  multi_cpu_stop+0xa8/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
| Hardware name: linux,dummy-virt (DT)

KCSAN splat on x86:

| write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2:
|  set_state kernel/stop_machine.c:170 [inline]
|  ack_state kernel/stop_machine.c:177 [inline]
|  multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7:
|  multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20191007104536.27276-1-mark.rutland@arm.com
2019-10-17 12:47:12 +02:00
Dmitry Goldin
700dea5a0b kheaders: substituting --sort in archive creation
The option --sort=ORDER was only introduced in tar 1.28 (2014), which
is rather new and might not be available in some setups.

This patch tries to replicate the previous behaviour as closely as
possible to fix the kheaders build for older environments. It does
not produce identical archives compared to the previous version due
to minor sorting differences but produces reproducible results itself
in my tests.

Reported-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch>
Tested-by: Andreas Schwab <schwab@suse.de>
Tested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-17 09:08:19 +09:00
Song Liu
eac9153f2b bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
   build-id, e.g. with trace.py from bcc tools:
   trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch

A sample reproducer is attached at the end.

This could also trigger deadlock with other locks that are nested with
rq_lock.

Fix this by checking whether irqs are disabled. Since rq_lock and all
other nested locks are irq safe, it is safe to do up_read() when irqs are
not disable. If the irqs are disabled, postpone up_read() in irq_work.

Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191014171223.357174-1-songliubraving@fb.com

Reproducer:
============================ 8< ============================

char *filename;

void *worker(void *p)
{
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
}

int main(int argc, char *argv[])
{
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
}
============================ 8< ============================
2019-10-16 10:37:52 -07:00
Ben Dooks
bc88f85c6c kthread: make __kthread_queue_delayed_work static
The __kthread_queue_delayed_work is not exported so
make it static, to avoid the following sparse warning:

  kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-16 09:20:58 -07:00
Linus Torvalds
02755af0f3 Merge branch 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:

 - Fix a parisc-specific fallout of Christoph's
   dma_set_mask_and_coherent() patches (Sven)

 - Fix a vmap memory leak in ioremap()/ioremap() (Helge)

 - Some minor cleanups and documentation updates (Nick, Helge)

* 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Remove 32-bit DMA enforcement from sba_iommu
  parisc: Fix vmap memory leak in ioremap()/iounmap()
  parisc: prefer __section from compiler_attributes.h
  parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
  MAINTAINERS: Add hp_sdc drivers to parisc arch
2019-10-15 09:37:01 -07:00
Christian Kellner
15d42eb26b pidfd: add NSpid entries to fdinfo
Currently, the fdinfo file contains the Pid field which shows the
pid a given pidfd refers to in the pid namespace of the procfs
instance. If pid namespaces are configured, also show an NSpid field
for easy retrieval of the pid in all descendant pid namespaces. If
the pid namespace of the process is not a descendant of the pid
namespace of the procfs instance 0 will be shown as its first NSpid
entry and no other entries will be shown. Add a block comment to
pidfd_show_fdinfo with a detailed explanation of Pid and NSpid fields.

Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Kellner <christian@kellner.me>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191014162034.2185-1-ckellner@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-15 12:17:00 +02:00
Helge Deller
b67114db64 parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
Signed-off-by: Helge Deller <deller@gmx.de>
2019-10-14 21:43:54 +02:00
David S. Miller
a98d62c3ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-10-14

The following pull-request contains BPF updates for your *net-next* tree.

12 days of development and
85 files changed, 1889 insertions(+), 1020 deletions(-)

The main changes are:

1) auto-generation of bpf_helper_defs.h, from Andrii.

2) split of bpf_helpers.h into bpf_{helpers, helper_defs, endian, tracing}.h
   and move into libbpf, from Andrii.

3) Track contents of read-only maps as scalars in the verifier, from Andrii.

4) small x86 JIT optimization, from Daniel.

5) cross compilation support, from Ivan.

6) bpf flow_dissector enhancements, from Jakub and Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14 12:17:21 -07:00
Eric Dumazet
ff229eee3d hrtimer: Annotate lockless access to timer->base
Followup to commit dd2261ed45 ("hrtimer: Protect lockless access
to timer->base")

lock_hrtimer_base() fetches timer->base without lock exclusion.

Compiler is allowed to read timer->base twice (even if considered dumb)
which could end up trying to lock migration_base and return
&migration_base.

  base = timer->base;
  if (likely(base != &migration_base)) {

       /* compiler reads timer->base again, and now (base == &migration_base)

       raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
       if (likely(base == timer->base))
            return base; /* == &migration_base ! */

Similarly the write sides must use WRITE_ONCE() to avoid store tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191008173204.180879-1-edumazet@google.com
2019-10-14 15:51:49 +02:00
Linus Torvalds
d4615e5a46 A few tracing fixes:
- Removed locked down from tracefs itself and moved it to the trace
    directory. Having the open functions there do the lockdown checks.
 
  - Fixed a few races with opening an instance file and the instance being
    deleted (Discovered during the locked down updates). Kept separate
    from the clean up code such that they can be backported to stable
    easier.
 
  - Cleaned up and consolidated the checks done when opening a trace
    file, as there were multiple checks that need to be done, and it
    did not make sense having them done in each open instance.
 
  - Fixed a regression in the record mcount code.
 
  - Small hw_lat detector tracer fixes.
 
  - A trace_pipe read fix due to not initializing trace_seq.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXaNhphQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quDIAP4v08ARNdIh+r+c4AOBm3xsOuE/d9GB
 I56ydnskm+x2JQD6Ap9ivXe9yDBIErFeHNtCoq7pM8YDI4YoYIB30N0GfwM=
 =7oAu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few tracing fixes:

   - Remove lockdown from tracefs itself and moved it to the trace
     directory. Have the open functions there do the lockdown checks.

   - Fix a few races with opening an instance file and the instance
     being deleted (Discovered during the lockdown updates). Kept
     separate from the clean up code such that they can be backported to
     stable easier.

   - Clean up and consolidated the checks done when opening a trace
     file, as there were multiple checks that need to be done, and it
     did not make sense having them done in each open instance.

   - Fix a regression in the record mcount code.

   - Small hw_lat detector tracer fixes.

   - A trace_pipe read fix due to not initializing trace_seq"

* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
  tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
  tracing/hwlat: Report total time spent in all NMIs during the sample
  recordmcount: Fix nop_mcount() function
  tracing: Do not create tracefs files if tracefs lockdown is in effect
  tracing: Add locked_down checks to the open calls of files created for tracefs
  tracing: Add tracing_check_open_get_tr()
  tracing: Have trace events system open call tracing_open_generic_tr()
  tracing: Get trace_array reference for available_tracers files
  ftrace: Get a reference counter for the trace_array on filter files
  tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
2019-10-13 14:47:10 -07:00
Petr Mladek
d303de1fcf tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
A customer reported the following softlockup:

[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002]  tracing_read_pipe+0x336/0x3c0
[899688.160002]  __vfs_read+0x26/0x140
[899688.160002]  vfs_read+0x87/0x130
[899688.160002]  SyS_read+0x42/0x90
[899688.160002]  do_syscall_64+0x74/0x160

It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.

Crashdump analyze uncovered that iter->seq was completely zeroed
at this point, including iter->seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.

The culprit seems to be in the code:

	/* reset all but tr, trace, and overruns */
	memset(&iter->seq, 0,
	       sizeof(struct trace_iterator) -
	       offsetof(struct trace_iterator, seq));

It was added by the commit 53d0aa7730 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter->seq looked like:

     struct trace_seq {
	unsigned char		buffer[PAGE_SIZE];
	unsigned int		len;
     };

There was no "size" variable and zeroing was perfectly fine.

The solution is to reinitialize the structure after or without
zeroing.

Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:34 -04:00
Srivatsa S. Bhat (VMware)
fc64e4ad80 tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
max_latency is intended to record the maximum ever observed hardware
latency, which may occur in either part of the loop (inner/outer). So
we need to also consider the outer-loop sample when updating
max_latency.

Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu

Fixes: e7c15cd8a1 ("tracing: Added hardware latency tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:33 -04:00
Srivatsa S. Bhat (VMware)
98dc19c114 tracing/hwlat: Report total time spent in all NMIs during the sample
nmi_total_ts is supposed to record the total time spent in *all* NMIs
that occur on the given CPU during the (active portion of the)
sampling window. However, the code seems to be overwriting this
variable for each NMI, thereby only recording the time spent in the
most recent NMI. Fix it by accumulating the duration instead.

Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu

Fixes: 7b2c862501 ("tracing: Add NMI tracing in hwlat detector")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:33 -04:00
Steven Rostedt (VMware)
17911ff38a tracing: Add locked_down checks to the open calls of files created for tracefs
Added various checks on open tracefs calls to see if tracefs is in lockdown
mode, and if so, to return -EPERM.

Note, the event format files (which are basically standard on all machines)
as well as the enabled_functions file (which shows what is currently being
traced) are not lockde down. Perhaps they should be, but it seems counter
intuitive to lockdown information to help you know if the system has been
modified.

Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:48:06 -04:00
Steven Rostedt (VMware)
8530dec63e tracing: Add tracing_check_open_get_tr()
Currently, most files in the tracefs directory test if tracing_disabled is
set. If so, it should return -ENODEV. The tracing_disabled is called when
tracing is found to be broken. Originally it was done in case the ring
buffer was found to be corrupted, and we wanted to prevent reading it from
crashing the kernel. But it's also called if a tracing selftest fails on
boot. It's a one way switch. That is, once it is triggered, tracing is
disabled until reboot.

As most tracefs files can also be used by instances in the tracefs
directory, they need to be carefully done. Each instance has a trace_array
associated to it, and when the instance is removed, the trace_array is
freed. But if an instance is opened with a reference to the trace_array,
then it requires looking up the trace_array to get its ref counter (as there
could be a race with it being deleted and the open itself). Once it is
found, a reference is added to prevent the instance from being removed (and
the trace_array associated with it freed).

Combine the two checks (tracing_disabled and trace_array_get()) into a
single helper function. This will also make it easier to add lockdown to
tracefs later.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:44:07 -04:00
Steven Rostedt (VMware)
aa07d71f1b tracing: Have trace events system open call tracing_open_generic_tr()
Instead of having the trace events system open call open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:43:00 -04:00
Steven Rostedt (VMware)
194c2c74f5 tracing: Get trace_array reference for available_tracers files
As instances may have different tracers available, we need to look at the
trace_array descriptor that shows the list of the available tracers for the
instance. But there's a race between opening the file and an admin
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.

Cc: stable@vger.kernel.org
Fixes: 607e2ea167 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:40:50 -04:00
Steven Rostedt (VMware)
9ef16693af ftrace: Get a reference counter for the trace_array on filter files
The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
an instance now. They need to take a reference to the instance otherwise
there could be a race between accessing the files and deleting the instance.

It wasn't until the :mod: caching where these file operations started
referencing the trace_array directly.

Cc: stable@vger.kernel.org
Fixes: 673feb9d76 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:40:21 -04:00
Linus Torvalds
328fefadd9 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
  quota precision fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/vtime: Fix guest/system mis-accounting on task switch
  sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
2019-10-12 15:29:54 -07:00
Linus Torvalds
465a7e291f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also a couple of updates for new Intel
  models (which are technically hw-enablement, but to users it's a fix
  to perf behavior on those new CPUs - hope this is fine), an AUX
  inheritance fix, event time-sharing fix, and a fix for lost non-perf
  NMI events on AMD systems"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  perf/x86/cstate: Add Tiger Lake CPU support
  perf/x86/msr: Add Tiger Lake CPU support
  perf/x86/intel: Add Tiger Lake CPU support
  perf/x86/cstate: Update C-state counters for Ice Lake
  perf/x86/msr: Add new CPU model numbers for Ice Lake
  perf/x86/cstate: Add Comet Lake CPU support
  perf/x86/msr: Add Comet Lake CPU support
  perf/x86/intel: Add Comet Lake CPU support
  perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp
  perf/core: Fix corner case in perf_rotate_context()
  perf/core: Rework memory accounting in perf_mmap()
  perf/core: Fix inheritance of aux_output groups
  perf annotate: Don't return -1 for error when doing BPF disassembly
  perf annotate: Return appropriate error code for allocation failures
  perf annotate: Fix arch specific ->init() failure errors
  perf annotate: Propagate the symbol__annotate() error return
  perf annotate: Fix the signedness of failure returns
  perf annotate: Propagate perf_env__arch() error
  perf evsel: Fall back to global 'perf_env' in perf_evsel__env()
  perf tools: Propagate get_cpuid() error
  ...
2019-10-12 15:15:17 -07:00
Andrii Nakryiko
2dedd7d216 bpf: Fix cast to pointer from integer of different size warning
Fix "warning: cast to pointer from integer of different size" when
casting u64 addr to void *.

Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191011172053.2980619-1-andriin@fb.com
2019-10-11 22:28:47 +02:00
Linus Torvalds
297cbcccc2 for-linus-20191010
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2f5MIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvscD/4v8E1s1rt6JwqM0Fa27UjRdhfGnc8ad8vs
 fD7rf3ZmLkoM1apVopMcAscUH726wU4qbxwEUDEntxxv2wJHuZdSZ64zhFJ17uis
 uJ2pF4MpK/m6DnHZu/SAU4t9aU+l6SBqX0tS1bFycecPgGRk46jrVX5tNJggt0Fy
 hqmx3ACWbkGiFDERT2AAQ69WHfmzeI9aUjx3jJY2eLnK7OjjEpyoEBs0j/AHl3ep
 kydhItU5NSFCv94X7vmZy/dvQ5hE4/1HTFfg79fOZcywQi1AN5DafKxiM2kgaSJ0
 jW58i+AFVtUPysNpVsxvjAgqGwDX/UJkOkggPd6V8/6LMfEvBKY4YNXlUEbqTN3Y
 pqn19/cgdKHaQpHKqwettcQujc71kry/yHsaudD+g2fi0efYi3d4qxIp9XA0TF03
 z6jzp8Hfo2SKbwapIFPa7Wqj86ZpbBxtROibCA17WKSNzn0UR3pJmEigo4l364ow
 nJpvZChLDHZXjovgzISmUnbR+O1yP0+ZnI9b7kgNp0UV4SI5ajf6f2T7667dcQs0
 J1GNt4QvqPza3R0z1SuoEi6tbc3GyMj7NZyIseNOXR/NtqXEWtiNvDIuZqs6Wn/T
 4GhaF0Mjqc17B3UEkdU1z09HL0JR40vUrGYE4lDxHhPWd0YngDGJJX2pZG2Y0WBp
 VQ20AzijzQ==
 =wZnt
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix wbt performance regression introduced with the blk-rq-qos
   refactoring (Harshad)

 - Fix io_uring fileset removal inadvertently killing the workqueue (me)

 - Fix io_uring typo in linked command nonblock submission (Pavel)

 - Remove spurious io_uring wakeups on request free (Pavel)

 - Fix null_blk zoned command error return (Keith)

 - Don't use freezable workqueues for backing_dev, also means we can
   revert a previous libata hack (Mika)

 - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

* tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
  nbd: fix possible sysfs duplicate warning
  null_blk: Fix zoned command return code
  io_uring: only flush workqueues on fileset removal
  io_uring: remove wait loop spurious wakeups
  blk-wbt: fix performance regression in wbt scale_up/scale_down
  Revert "libata, freezer: avoid block device removal while system is frozen"
  bdi: Do not use freezable workqueue
  io_uring: fix reversed nonblock flag for link submission
2019-10-11 08:45:32 -07:00
Oleg Nesterov
937c6b27c7 cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
another preemption point.

Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-11 08:39:57 -07:00
Andrii Nakryiko
a23740ec43 bpf: Track contents of read-only maps as scalars
Maps that are read-only both from BPF program side and user space side
have their contents constant, so verifier can track referenced values
precisely and use that knowledge for dead code elimination, branch
pruning, etc. This patch teaches BPF verifier how to do this.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191009201458.2679171-2-andriin@fb.com
2019-10-11 01:49:15 +02:00
Christian Brauner
fb3c5386b3 seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
This allows the seccomp notifier to continue a syscall. A positive
discussion about this feature was triggered by a post to the
ksummit-discuss mailing list (cf. [3]) and took place during KSummit
(cf. [1]) and again at the containers/checkpoint-restore
micro-conference at Linux Plumbers.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
which enables a process (watchee) to retrieve an fd for its seccomp
filter. This fd can then be handed to another (usually more privileged)
process (watcher). The watcher will then be able to receive seccomp
messages about the syscalls having been performed by the watchee.

This feature is heavily used in some userspace workloads. For example,
it is currently used to intercept mknod() syscalls in user namespaces
aka in containers.
The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), a lot of other syscall we intercept (e.g.
setxattr()) cannot be easily filtered like mknod() because they have
pointer arguments. Additionally, some of them might actually succeed in
user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
currently cannot tell seccomp to continue from a user notifier we are
stuck with performing all of the syscalls in lieu of the container. This
is a huge security liability since it is extremely difficult to
correctly assume all of the necessary privileges of the calling task
such that the syscall can be successfully emulated without escaping
other additional security restrictions (think missing CAP_MKNOD for
mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
telling seccomp to resume the syscall.

One thing that came up in the discussion was the problem that another
thread could change the memory after userspace has decided to let the
syscall continue which is a well known TOCTOU with seccomp which is
present in other ways already.
The discussion showed that this feature is already very useful for any
syscall without pointer arguments. For any accidentally intercepted
non-pointer syscall it is safe to continue.
For syscalls with pointer arguments there is a race but for any cautious
userspace and the main usec cases the race doesn't matter. The notifier
is intended to be used in a scenario where a more privileged watcher
supervises the syscalls of lesser privileged watchee to allow it to get
around kernel-enforced limitations by performing the syscall for it
whenever deemed save by the watcher. Hence, if a user tricks the watcher
into allowing a syscall they will either get a deny based on
kernel-enforced restrictions later or they will have changed the
arguments in such a way that they manage to perform a syscall with
arguments that they would've been allowed to do anyway.
In general, it is good to point out again, that the notifier fd was not
intended to allow userspace to implement a security policy but rather to
work around kernel security mechanisms in cases where the watcher knows
that a given action is safe to perform.

/* References */
[1]: https://linuxplumbersconf.org/event/4/contributions/560
[2]: https://linuxplumbersconf.org/event/4/contributions/477
[3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
[4]: commit 6a21cc50f0 ("seccomp: add a return code to trap to userspace")

Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
CC: Tyler Hicks <tyhicks@canonical.com>
Link: https://lore.kernel.org/r/20190920083007.11475-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-10-10 14:45:51 -07:00
Douglas Anderson
2277b49258 kdb: Fix stack crawling on 'running' CPUs that aren't the master
In kdb when you do 'btc' (back trace on CPU) it doesn't necessarily
give you the right info.  Specifically on many architectures
(including arm64, where I tested) you can't dump the stack of a
"running" process that isn't the process running on the current CPU.
This can be seen by this:

 echo SOFTLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
 # wait 2 seconds
 <sysrq>g

Here's what I see now on rk3399-gru-kevin.  I see the stack crawl for
the CPU that handled the sysrq but everything else just shows me stuck
in __switch_to() which is bogus:

======

[0]kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0, 1-3(I), 4, 5(I)
Stack traceback for pid 0
0xffffff801101a9c0        0        0  1    0   R  0xffffff801101b3b0 *swapper/0
Call trace:
 dump_backtrace+0x0/0x138
 ...
 kgdb_compiled_brk_fn+0x34/0x44
 ...
 sysrq_handle_dbg+0x34/0x5c
Stack traceback for pid 0
0xffffffc0f175a040        0        0  1    1   I  0xffffffc0f175aa30  swapper/1
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65616c0
Stack traceback for pid 0
0xffffffc0f175d040        0        0  1    2   I  0xffffffc0f175da30  swapper/2
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65806c0
Stack traceback for pid 0
0xffffffc0f175b040        0        0  1    3   I  0xffffffc0f175ba30  swapper/3
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f659f6c0
Stack traceback for pid 1474
0xffffffc0dde8b040     1474      727  1    4   R  0xffffffc0dde8ba30  bash
Call trace:
 __switch_to+0x1e4/0x240
 __schedule+0x464/0x618
 0xffffffc0dde8b040
Stack traceback for pid 0
0xffffffc0f17b0040        0        0  1    5   I  0xffffffc0f17b0a30  swapper/5
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65dd6c0

===

The problem is that 'btc' eventually boils down to
  show_stack(task_struct, NULL);

...and show_stack() doesn't work for "running" CPUs because their
registers haven't been stashed.

On x86 things might work better (I haven't tested) because kdb has a
special case for x86 in kdb_show_stack() where it passes the stack
pointer to show_stack().  This wouldn't work on arm64 where the stack
crawling function seems needs the "fp" and "pc", not the "sp" which is
presumably why arm64's show_stack() function totally ignores the "sp"
parameter.

NOTE: we _can_ get a good stack dump for all the cpus if we manually
switch each one to the kdb master and do a back trace.  AKA:
  cpu 4
  bt
...will give the expected trace.  That's because now arm64's
dump_backtrace will now see that "tsk == current" and go through a
different path.

In this patch I fix the problems by catching a request to stack crawl
a task that's running on a CPU and then I ask that CPU to do the stack
crawl.

NOTE: this will (presumably) change what stack crawls are printed for
x86 machines.  Now kdb functions will show up in the stack crawl.
Presumably this is OK but if it's not we can go back and add a special
case for x86 again.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:48 +01:00
Douglas Anderson
55a7e23f46 kdb: Fix "btc <cpu>" crash if the CPU didn't round up
I noticed that when I did "btc <cpu>" and the CPU I passed in hadn't
rounded up that I'd crash.  I was going to copy the same fix from
commit 162bc7f5af ("kdb: Don't back trace on a cpu that didn't round
up") into the "not all the CPUs" case, but decided it'd be better to
clean things up a little bit.

This consolidates the two code paths.  It is _slightly_ wasteful in in
that the checks for "cpu" being too small or being offline isn't
really needed when we're iterating over all online CPUs, but that
really shouldn't hurt.  Better to have the same code path.

While at it, eliminate at least one slightly ugly (and totally
needless) recursive use of kdb_parse().

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:14 +01:00
Douglas Anderson
54af3e39ee kdb: Remove unused "argcount" param from kdb_bt1(); make btaprompt bool
The kdb_bt1() had a mysterious "argcount" parameter passed in (always
the number 5, by the way) and never used.  Presumably this is just old
cruft.  Remove it.  While at it, upgrade the btaprompt parameter to a
full fledged bool instead of an int.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:08 +01:00
Douglas Anderson
0f8b5b6d56 kgdb: Remove unused DCPU_SSTEP definition
From doing a 'git log --patch kernel/debug', it looks as if DCPU_SSTEP
has never been used.  Presumably it used to be used back when kgdb was
out of tree and nobody thought to delete the definition when the usage
went away.  Delete.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:27:52 +01:00
Ben Dooks
f49249d58a PM: sleep: include <linux/pm_runtime.h> for pm_wq
Include the <linux/runtime_pm.h> for the definition of
pm_wq to avoid the following warning:

kernel/power/main.c:890:25: warning: symbol 'pm_wq' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-10 11:11:56 +02:00
Qian Cai
5facae4f35 locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:

  b4adfe8e05 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:46:10 +02:00
Song Liu
7fa343b7fd perf/core: Fix corner case in perf_rotate_context()
In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

 $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

           time             counts unit events
    1.000152973         15,412,480      ref-cycles:D
    1.000152973      <not counted>      ref-cycles     (0.00%)
    1.000152973      <not counted>      cycles         (0.00%)
    2.000486957         18,263,120      ref-cycles:D
    2.000486957      <not counted>      ref-cycles     (0.00%)
    2.000486957      <not counted>      cycles         (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8d5bce0c37 ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:13 +02:00
Song Liu
d44248a413 perf/core: Rework memory accounting in perf_mmap()
perf_mmap() always increases user->locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

  | # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
  | 0                    | 0                   | 0             |
  | 1                    | user_extra          | user_extra    |
  | 2                    | 3 * user_extra      | 2 * user_extra|
  | 3                    | 6 * user_extra      | 3 * user_extra|
  | 4                    | 10 * user_extra     | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li <hechaol@fb.com>
Reported-by: Hechao Li <hechaol@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Jie Meng <jmeng@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:12 +02:00
Frederic Weisbecker
8d495477d6 sched/cputime: Spare a seqcount lock/unlock cycle on context switch
On context switch we are locking the vtime seqcount of the scheduling-out
task twice:

 * On vtime_task_switch_common(), when we flush the pending vtime through
   vtime_account_system()

 * On arch_vtime_task_switch() to reset the vtime state.

This is pointless as these actions can be performed without the need
to unlock/lock in the middle. The reason these steps are separated is to
consolidate a very small amount of common code between
CONFIG_VIRT_CPU_ACCOUNTING_GEN and CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.

Performance in this fast path is definitely a priority over artificial
code factorization so split the task switch code between GEN and
NATIVE and mutualize the parts than can run under a single seqcount
locked block.

As a side effect, vtime_account_idle() becomes included in the seqcount
protection. This happens to be a welcome preparation in order to
properly support kcpustat under vtime in the future and fetch
CPUTIME_IDLE without race.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:39:26 +02:00
Frederic Weisbecker
f83eeb1a01 sched/cputime: Rename vtime_account_system() to vtime_account_kernel()
vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).

So this function is a misnomer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.

Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:39:25 +02:00
Frederic Weisbecker
68e7a4d66b sched/vtime: Fix guest/system mis-accounting on task switch
vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

	vtime_common_task_switch(prev)
		vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 2a42eb9594 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:38:03 +02:00
Xuewei Zhang
4929a4e6fa sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e192263 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:38:02 +02:00
Linus Torvalds
eda57a0e42 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "The usual shower of hotfixes.

  Chris's memcg patches aren't actually fixes - they're mature but a few
  niggling review issues were late to arrive.

  The ocfs2 fixes are quite old - those took some time to get reviewer
  attention.

  Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
  mm/slab-generic"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
  mm, sl[ou]b: improve memory accounting
  mm, memcg: make scan aggression always exclude protection
  mm, memcg: make memory.emin the baseline for utilisation determination
  mm, memcg: proportional memory.{low,min} reclaim
  mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
  mm/page_alloc.c: fix a crash in free_pages_prepare()
  mm/z3fold.c: claim page in the beginning of free
  kernel/sysctl.c: do not override max_threads provided by userspace
  memcg: only record foreign writebacks with dirty pages when memcg is not disabled
  mm: fix -Wmissing-prototypes warnings
  writeback: fix use-after-free in finish_writeback_work()
  mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
  panic: ensure preemption is disabled during panic()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
  fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
  ocfs2: clear zero in unaligned direct IO
2019-10-07 16:04:19 -07:00
Michal Hocko
b0f53dbc4b kernel/sysctl.c: do not override max_threads provided by userspace
Partially revert 16db3d3f11 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.

set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory.  This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change.  It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.

Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction.  While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.

This became more severe when we switched x86 from 4k to 8k kernel
stacks.  Starting since 6538b8ea88 ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.

In the particular case

  3.12
  kernel.threads-max = 515561

  4.4
  kernel.threads-max = 200000

Neither of the two values is really insane on 32GB machine.

I am not sure we want/need to tune the max_thread value further.  If
anything the tuning should be removed altogether if proven not useful in
general.  But we definitely need a way to override this auto-tuning.

Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f11 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:19 -07:00
Will Deacon
20bb759a66 panic: ensure preemption is disabled during panic()
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled.  From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.

This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:

  | sysrq: Trigger a crash
  | Kernel panic - not syncing: sysrq triggered crash
  | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
  | Hardware name: linux,dummy-virt (DT)
  | Call trace:
  |  dump_backtrace+0x0/0x148
  |  show_stack+0x14/0x20
  |  dump_stack+0xa0/0xc4
  |  panic+0x140/0x32c
  |  sysrq_handle_reboot+0x0/0x20
  |  __handle_sysrq+0x124/0x190
  |  write_sysrq_trigger+0x64/0x88
  |  proc_reg_write+0x60/0xa8
  |  __vfs_write+0x18/0x40
  |  vfs_write+0xa4/0x1b8
  |  ksys_write+0x64/0xf0
  |  __arm64_sys_write+0x14/0x20
  |  el0_svc_common.constprop.0+0xb0/0x168
  |  el0_svc_handler+0x28/0x78
  |  el0_svc+0x8/0xc
  | Kernel Offset: disabled
  | CPU features: 0x0002,24002004
  | Memory Limit: none
  | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
  |  Ping 2!
  |  Ping 1!
  |  Ping 1!
  |  Ping 2!

The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.

Disable preemption in 'panic()' before re-enabling interrupts.

Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon <will@kernel.org>
Reported-by: Xogium <contact@xogium.me>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:19 -07:00
Alexander Shishkin
f733c6b508 perf/core: Fix inheritance of aux_output groups
Commit:

  ab43762ef0 ("perf: Allow normal events to output AUX data")

forgets to configure aux_output relation in the inherited groups, which
results in child PEBS events forever failing to schedule.

Fix this by setting up the AUX output link in the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-07 16:50:42 +02:00
Michal Koutný
9a3284fad4 cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c3 ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.

The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.

There are several places that need to synchronize with migration:
	a) do_exit,
	b) de_thread,
	c) copy_process,
	d) cgroup_update_dfl_csses,
	e) parallel migration (cgroup_{proc,thread}s_write).

In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.

This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.

This change improves migration dependent workload performance similar
to per-signal_struct state.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:11:53 -07:00
Michal Koutný
e7c7b1d85d cgroup: Update comments about task exit path
We no longer take cgroup_mutex in cgroup_exit and the exiting tasks are
not moved to init_css_set, reflect that in several comments to prevent
confusion.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:11:53 -07:00
Miaohe Lin
61e867fde2 cgroup: short-circuit current_cgns_cgroup_from_root() on the default hierarchy
Like commit 13d82fb77a ("cgroup: short-circuit cset_cgroup_from_root() on
the default hierarchy"), short-circuit current_cgns_cgroup_from_root() on
the default hierarchy.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:01:09 -07:00
Linus Torvalds
7cdb85df60 dma-mapping regression fix for 5.4-rc2
- revert an incorret hunk from a patch that caused problems
    on various arm boards (Andrey Smirnov)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2aFRoLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMBeBAAuTsOh1amMUdsAJN67PJcHP8JkOlR21cjLVaKkvWh
 l5XnXITtlNvyzXH67jZuQL15+rQ/kTOkmSc5bIDZW7+sTW2Rwnq6bIQOZpBYKlol
 U/UTBtk26SliKoinlJekKWAA6o32PJU2oLOsTmqoCqH5k0aeKdNHAFPSw4fU3jbW
 U4Sv0uc6MI+PM9OM3H/T60qQPvziOkeDp4wAZZ5AO/kUbNgzUrbGatNk26QqgNbs
 NsAVQ3X/sgUAwXMtivo9nFUd2fuEIf9GueGVohGiW+2znWQ8AxY76/FgxzXICmMi
 S0YLqPrdlzzZ0K7k8enPvJo2hd0qh3yFtWyGx9fUt+EBXepp/hMTIRAEVUHpiiSg
 PDTU74TVtXwSYvIQA6jR1bwh9+aMyeDWDFzUwFQh34mahAsZsBKhNLAcpN2uNGv7
 XLL3Lqi58eIhaSaqxM4ASIsBS+FIiQiOdqq4eLVx+x6wxjNDTyZ+ynbWdNs8+SYh
 MIyjY3wibMwaXtFUBV6LgYtwBF/1pgFcu9jWz02HT7Od0c+Et04ihcXISH+w9fpB
 O5WFjo0Oag2HoNm1ODOlLu5DY9CSQftrv4zl0yTQgb1vFB3fPdtv43wIQ8SkVhVu
 kwuF4kgIAyRRoe7HCPK/FJjKiYCo6Y+3WZ/4X7ktddCpxjaVYfclv8pMotirCQPU
 SSo=
 =kS6W
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping regression fix from Christoph Hellwig:
 "Revert an incorret hunk from a patch that caused problems on various
  arm boards (Andrey Smirnov)"

* tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix false positive warnings in dma_common_free_remap()
2019-10-06 11:10:15 -07:00
Mika Westerberg
0e48f51cbb Revert "libata, freezer: avoid block device removal while system is frozen"
This reverts commit 85fbd722ad.

The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-06 09:11:37 -06:00
Linus Torvalds
2d00aee21a Kbuild fixes for v5.4
- remove unneeded ar-option and KBUILD_ARFLAGS
 
  - remove long-deprecated SUBDIRS
 
  - fix modpost to suppress false-positive warnings for UML builds
 
  - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}
 
  - make setlocalversion work for /bin/sh
 
  - make header archive reproducible
 
  - fix some Makefiles and documents
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2YPUEeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGVu4P/3Qv7Ov3/R4BlgYb
 +LaKupCY/ADE5bRAv/N5AAy37+TJmTLQswN2/JwHflYvTeWd4kZjquFpJkFNwMsk
 Qlb79NQvyM9+NlFfSFjap8HBNb0J8A+92aKmrHmh1sQqJJs6JPZ0MOGoAXmgsJaN
 SPLvhqophKpmYu7Oa0x2aC2kq+1DnCQyMLTOuVCdrtF0tF8w0hiowDz5GOmOi1U6
 VK2ECfzjenFkfbqZOUVBPVfPR9hMpmVBdKdFLwD/iTKVkShZcWmdbxk/ADbemyet
 2njehRF2HGp7opbwM4AxIeIubCqYSkThUpLJarKWk/8W87gksH6pCR8yIq1nOwkO
 l+/GY2YdvkBdDCkSKpLiQxtJaqnZb8Yv1ZPvCfGF09Ba8tFtwX+HSecSkLFHGyJv
 K9FD0XSOFBkQesZWdpIr/xeLwwiuSH80QACrub1Z5Q4OCURaBkKwrO/eDG1/2Xle
 YKGZO2va2VVkeo5bisOZ2vfISwZrtiaGakQ8vTdq/5RO59/JvQjsGB8KbccaKXAu
 Ozk8vVqkwTmLP6gzIEd2Wr/ICNGuAVc0EELT7lj07hcd6rzsCxPWVXqTFsCkGBJe
 587i1jeH1z9oyrHUcP6dhR3joIuOUuUJk1uR7YZq4L4POSvrJnvzMFkSv6tBKL2p
 Uq9qD7mpt/9zl3PART7HK9KYfTGJ
 =fSXc
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - remove unneeded ar-option and KBUILD_ARFLAGS

 - remove long-deprecated SUBDIRS

 - fix modpost to suppress false-positive warnings for UML builds

 - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}

 - make setlocalversion work for /bin/sh

 - make header archive reproducible

 - fix some Makefiles and documents

* tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: make headers archive reproducible
  kbuild: update compile-test header list for v5.4-rc2
  kbuild: two minor updates for Documentation/kbuild/modules.rst
  scripts/setlocalversion: clear local variable to make it work for sh
  namespace: fix namespace.pl script to support relative paths
  video/logo: do not generate unneeded logo C files
  video/logo: remove unneeded *.o pattern from clean-files
  integrity: remove pointless subdir-$(CONFIG_...)
  integrity: remove unneeded, broken attempt to add -fshort-wchar
  modpost: fix static EXPORT_SYMBOL warnings for UML build
  kbuild: correct formatting of header in kbuild module docs
  kbuild: remove SUBDIRS support
  kbuild: remove ar-option and KBUILD_ARFLAGS
2019-10-05 12:56:59 -07:00
Wolfgang M. Reimer
67d64918a1 locking: locktorture: Do not include rwlock.h directly
Including rwlock.h directly will cause kernel builds to fail
if CONFIG_PREEMPT_RT is defined. The correct header file
(rwlock_rt.h OR rwlock.h) will be included by spinlock.h which
is included by locktorture.c anyway.

Remove the include of linux/rwlock.h.

Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:24 -07:00
Paul E. McKenney
fbbd5e358c rcutorture: Make in-kernel-loop testing more brutal
The rcu_torture_fwd_prog_nr() tests the ability of RCU to tolerate
in-kernel busy loops.  It invokes rcu_torture_fwd_prog_cond_resched()
within its delay loop, which, in PREEMPT && NO_HZ_FULL kernels results
in the occasional direct call to schedule().  Now, this direct call to
schedule() is appropriate for call_rcu() flood testing, in which either
the kernel should restrain itself or userspace transitions will supply
the needed restraint.  But in pure in-kernel loops, the occasional
cond_resched() should do the job.

This commit therefore makes rcu_torture_fwd_prog_nr() use cond_resched()
instead of rcu_torture_fwd_prog_cond_resched() in order to increase the
brutality of this aspect of rcutorture testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:18 -07:00
Paul E. McKenney
8b5ddf8b99 rcutorture: Separate warnings for each failure type
Currently, each of six different types of failure triggers a
single WARN_ON_ONCE(), and it is then necessary to stare at the
rcu_torture_stats(), Reader Pipe, and Reader Batch lines looking for
inappropriately non-zero values.  This can be annoying and error-prone,
so this commit provides a separate WARN_ON_ONCE() for each of the
six error conditions and adds short comments to each to ease error
identification.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:03 -07:00
Ethan Hansen
b3ffb206dd rcu: Remove unused variable rcu_perf_writer_state
The variable rcu_perf_writer_state is declared and initialized,
but is never actually referenced. Remove it to clean code.

Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com>
[ paulmck: Also removed unused macros assigned to that variable. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:49:36 -07:00
Chuhong Yuan
c5d3c8ca22 locktorture: Replace strncmp() with str_has_prefix()
The strncmp() function is error-prone because it is easy to get the
length wrong, especially if the string is subject to change, especially
given the need to account for the terminating nul byte.  This commit
therefore substitutes the newly introduced str_has_prefix(), which
does not require a separately specified length.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:48:38 -07:00
Ethan Hansen
ac5f636130 rcu: Remove unused function rcutorture_record_progress()
The function rcutorture_record_progress() is declared in rcu.h, but is
never used.  This commit therefore removes rcutorture_record_progress()
to clean code.

Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:48:13 -07:00
Paul E. McKenney
79ba7ff5a9 rcutorture: Emulate dyntick aspect of userspace nohz_full sojourn
During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution on nohz_full CPUs implies an RCU dyntick idle/not-idle
transition pair, so this commit adds emulation of that pair.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
96926686de rcu: Make CPU-hotplug removal operations enable tick
CPU-hotplug removal operations run the multi_cpu_stop() function, which
relies on the scheduler to gain control from whatever is running on the
various online CPUs, including any nohz_full CPUs running long loops in
kernel-mode code.  Lack of the scheduler-clock interrupt on such CPUs
can delay multi_cpu_stop() for several minutes and can also result in
RCU CPU stall warnings.  This commit therefore causes CPU-hotplug removal
operations to enable the scheduler-clock interrupt on all online CPUs.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Apply simplifications suggested by Frederic Weisbecker. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
366237e7b0 stop_machine: Provide RCU quiescent state in multi_cpu_stop()
When multi_cpu_stop() loops waiting for other tasks, it can trigger an RCU
CPU stall warning.  This can be misleading because what is instead needed
is information on whatever task is blocking multi_cpu_stop().  This commit
therefore inserts an RCU quiescent state into the multi_cpu_stop()
function's waitloop.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
d38e6dc6ed rcutorture: Force on tick for readers and callback flooders
Readers and callback flooders in the rcutorture stress-test suite run for
extended time periods by design.  They do take pains to relinquish the
CPU from time to time, but in some cases this relies on the scheduler
being active, which in turn relies on the scheduler-clock interrupt
firing from time to time.

This commit therefore forces scheduling-clock interrupts within
these loops.  While in the area, this commit also prevents
rcu_torture_reader()'s occasional timed sleeps from delaying shutdown.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:04 -07:00
Paul E. McKenney
6a949b7af8 rcu: Force on tick when invoking lots of callbacks
Callback invocation can run for a significant time period, and within
CONFIG_NO_HZ_FULL=y kernels, this period will be devoid of scheduler-clock
interrupts.  In-kernel execution without such interrupts can cause all
manner of malfunction, with RCU CPU stall warnings being but one result.

This commit therefore forces scheduling-clock interrupts on whenever more
than a few RCU callbacks are invoked.  Because offloaded callback invocation
can be preempted, this forcing is withdrawn on each context switch.  This
in turn requires that the loop invoking RCU callbacks reiterate the forcing
periodically.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Remove NO_HZ_FULL check per Frederic Weisbecker feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:04 -07:00
Paul E. McKenney
ae9e557b5b time: Export tick start/stop functions for rcutorture
It turns out that rcutorture needs to ensure that the scheduling-clock
interrupt is enabled in CONFIG_NO_HZ_FULL kernels before starting on
CPU-bound in-kernel processing.  This commit therefore exports
tick_nohz_dep_set_task(), tick_nohz_dep_clear_task(), and
tick_nohz_full_setup() to GPL kernel modules.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:03 -07:00
Frederic Weisbecker
01b4c39901 nohz: Add TICK_DEP_BIT_RCU
If a nohz_full CPU is looping in the kernel, the scheduling-clock tick
might nevertheless remain disabled.  In !PREEMPT kernels, this can
prevent RCU's attempts to enlist the aid of that CPU's executions of
cond_resched(), which can in turn result in an arbitrarily delayed grace
period and thus an OOM.  RCU therefore needs a way to enable a holdout
nohz_full CPU's scheduler-clock interrupt.

This commit therefore provides a new TICK_DEP_BIT_RCU value which RCU can
pass to tick_dep_set_cpu() and friends to force on the scheduler-clock
interrupt for a specified CPU or task.  In some cases, rcutorture needs
to turn on the scheduler-clock tick, so this commit also exports the
relevant symbols to GPL-licensed modules.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:45:16 -07:00
Andrey Smirnov
2cf2aa6a69 dma-mapping: fix false positivse warnings in dma_common_free_remap()
Commit 5cf4537975 ("dma-mapping: introduce a dma_common_find_pages
helper") changed invalid input check in dma_common_free_remap() from:

    if (!area || !area->flags != VM_DMA_COHERENT)

to

    if (!area || !area->flags != VM_DMA_COHERENT || !area->pages)

which seem to produce false positives for memory obtained via
dma_common_contiguous_remap()

This triggers the following warning message when doing "reboot" on ZII
VF610 Dev Board Rev B:

WARNING: CPU: 0 PID: 1 at kernel/dma/remap.c:112 dma_common_free_remap+0x88/0x8c
trying to free invalid coherent area: 9ef82980
Modules linked in:
CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.3.0-rc6-next-20190820 #119
Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
Backtrace:
[<8010d1ec>] (dump_backtrace) from [<8010d588>] (show_stack+0x20/0x24)
 r7:8015ed78 r6:00000009 r5:00000000 r4:9f4d9b14
[<8010d568>] (show_stack) from [<8077e3f0>] (dump_stack+0x24/0x28)
[<8077e3cc>] (dump_stack) from [<801197a0>] (__warn.part.3+0xcc/0xe4)
[<801196d4>] (__warn.part.3) from [<80119830>] (warn_slowpath_fmt+0x78/0x94)
 r6:00000070 r5:808e540c r4:81c03048
[<801197bc>] (warn_slowpath_fmt) from [<8015ed78>] (dma_common_free_remap+0x88/0x8c)
 r3:9ef82980 r2:808e53e0
 r7:00001000 r6:a0b1e000 r5:a0b1e000 r4:00001000
[<8015ecf0>] (dma_common_free_remap) from [<8010fa9c>] (remap_allocator_free+0x60/0x68)
 r5:81c03048 r4:9f4d9b78
[<8010fa3c>] (remap_allocator_free) from [<801100d0>] (__arm_dma_free.constprop.3+0xf8/0x148)
 r5:81c03048 r4:9ef82900
[<8010ffd8>] (__arm_dma_free.constprop.3) from [<80110144>] (arm_dma_free+0x24/0x2c)
 r5:9f563410 r4:80110120
[<80110120>] (arm_dma_free) from [<8015d80c>] (dma_free_attrs+0xa0/0xdc)
[<8015d76c>] (dma_free_attrs) from [<8020f3e4>] (dma_pool_destroy+0xc0/0x154)
 r8:9efa8860 r7:808f02f0 r6:808f02d0 r5:9ef82880 r4:9ef82780
[<8020f324>] (dma_pool_destroy) from [<805525d0>] (ehci_mem_cleanup+0x6c/0x150)
 r7:9f563410 r6:9efa8810 r5:00000000 r4:9efd0148
[<80552564>] (ehci_mem_cleanup) from [<80558e0c>] (ehci_stop+0xac/0xc0)
 r5:9efd0148 r4:9efd0000
[<80558d60>] (ehci_stop) from [<8053c4bc>] (usb_remove_hcd+0xf4/0x1b0)
 r7:9f563410 r6:9efd0074 r5:81c03048 r4:9efd0000
[<8053c3c8>] (usb_remove_hcd) from [<8056361c>] (host_stop+0x48/0xb8)
 r7:9f563410 r6:9efd0000 r5:9f5f4040 r4:9f5f5040
[<805635d4>] (host_stop) from [<80563d0c>] (ci_hdrc_host_destroy+0x34/0x38)
 r7:9f563410 r6:9f5f5040 r5:9efa8800 r4:9f5f4040
[<80563cd8>] (ci_hdrc_host_destroy) from [<8055ef18>] (ci_hdrc_remove+0x50/0x10c)
[<8055eec8>] (ci_hdrc_remove) from [<804a2ed8>] (platform_drv_remove+0x34/0x4c)
 r7:9f563410 r6:81c4f99c r5:9efa8810 r4:9efa8810
[<804a2ea4>] (platform_drv_remove) from [<804a18a8>] (device_release_driver_internal+0xec/0x19c)
 r5:00000000 r4:9efa8810
[<804a17bc>] (device_release_driver_internal) from [<804a1978>] (device_release_driver+0x20/0x24)
 r7:9f563410 r6:81c41ed0 r5:9efa8810 r4:9f4a1dac
[<804a1958>] (device_release_driver) from [<804a01b8>] (bus_remove_device+0xdc/0x108)
[<804a00dc>] (bus_remove_device) from [<8049c204>] (device_del+0x150/0x36c)
 r7:9f563410 r6:81c03048 r5:9efa8854 r4:9efa8810
[<8049c0b4>] (device_del) from [<804a3368>] (platform_device_del.part.2+0x20/0x84)
 r10:9f563414 r9:809177e0 r8:81cb07dc r7:81c78320 r6:9f563454 r5:9efa8800
 r4:9efa8800
[<804a3348>] (platform_device_del.part.2) from [<804a3420>] (platform_device_unregister+0x28/0x34)
 r5:9f563400 r4:9efa8800
[<804a33f8>] (platform_device_unregister) from [<8055dce0>] (ci_hdrc_remove_device+0x1c/0x30)
 r5:9f563400 r4:00000001
[<8055dcc4>] (ci_hdrc_remove_device) from [<805652ac>] (ci_hdrc_imx_remove+0x38/0x118)
 r7:81c78320 r6:9f563454 r5:9f563410 r4:9f541010
[<8056538c>] (ci_hdrc_imx_shutdown) from [<804a2970>] (platform_drv_shutdown+0x2c/0x30)
[<804a2944>] (platform_drv_shutdown) from [<8049e4fc>] (device_shutdown+0x158/0x1f0)
[<8049e3a4>] (device_shutdown) from [<8013ac80>] (kernel_restart_prepare+0x44/0x48)
 r10:00000058 r9:9f4d8000 r8:fee1dead r7:379ce700 r6:81c0b280 r5:81c03048
 r4:00000000
[<8013ac3c>] (kernel_restart_prepare) from [<8013ad14>] (kernel_restart+0x1c/0x60)
[<8013acf8>] (kernel_restart) from [<8013af84>] (__do_sys_reboot+0xe0/0x1d8)
 r5:81c03048 r4:00000000
[<8013aea4>] (__do_sys_reboot) from [<8013b0ec>] (sys_reboot+0x18/0x1c)
 r8:80101204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
[<8013b0d4>] (sys_reboot) from [<80101000>] (ret_fast_syscall+0x0/0x54)
Exception stack(0x9f4d9fa8 to 0x9f4d9ff0)
9fa0:                   00000000 00000000 fee1dead 28121969 01234567 379ce700
9fc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00016d04
9fe0: 00028e0c 7ec87c64 000135ec 76c1f410

Restore original invalid input check in dma_common_free_remap() to
avoid this problem.

Fixes: 5cf4537975 ("dma-mapping: introduce a dma_common_find_pages helper")
Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com>
[hch: just revert the offending hunk instead of creating a new helper]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-05 10:24:17 +02:00
Dmitry Goldin
86cdd2fdc4 kheaders: make headers archive reproducible
In commit 43d8ce9d65 ("Provide in-kernel headers to make
extending kernel easier") a new mechanism was introduced, for kernels
>=5.2, which embeds the kernel headers in the kernel image or a module
and exposes them in procfs for use by userland tools.

The archive containing the header files has nondeterminism caused by
header files metadata. This patch normalizes the metadata and utilizes
KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the
default behaviour.

In commit f7b101d330 ("kheaders: Move from proc to sysfs") it was
modified to use sysfs and the script for generation of the archive was
renamed to what is being patched.

Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-05 15:29:49 +09:00
Linus Torvalds
e524d16e7e copy-struct-from-user-v5.4-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXZZIgQAKCRCRxhvAZXjc
 orNOAP98B2nmoxvq8d5Z6PhoyTBC5NIUuJ5h2YMwcX/hAaj5uQEA58NTKtPmOPDR
 2ffUFFerGZ2+brlHgACa0ZKdH27TjAA=
 =QryD
 -----END PGP SIGNATURE-----

Merge tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull copy_struct_from_user() helper from Christian Brauner:
 "This contains the copy_struct_from_user() helper which got split out
  from the openat2() patchset. It is a generic interface designed to
  copy a struct from userspace.

  The helper will be especially useful for structs versioned by size of
  which we have quite a few. This allows for backwards compatibility,
  i.e. an extended struct can be passed to an older kernel, or a legacy
  struct can be passed to a newer kernel. For the first case (extended
  struct, older kernel) the new fields in an extended struct can be set
  to zero and the struct safely passed to an older kernel.

  The most obvious benefit is that this helper lets us get rid of
  duplicate code present in at least sched_setattr(), perf_event_open(),
  and clone3(). More importantly it will also help to ensure that users
  implementing versioning-by-size end up with the same core semantics.

  This point is especially crucial since we have at least one case where
  versioning-by-size is used but with slighly different semantics:
  sched_setattr(), perf_event_open(), and clone3() all do do similar
  checks to copy_struct_from_user() while rt_sigprocmask(2) always
  rejects differently-sized struct arguments.

  With this pull request we also switch over sched_setattr(),
  perf_event_open(), and clone3() to use the new helper"

* tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  usercopy: Add parentheses around assignment in test_copy_struct_from_user
  perf_event_open: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  clone3: switch to copy_struct_from_user()
  lib: introduce copy_struct_from_user() helper
2019-10-04 10:36:31 -07:00
Tejun Heo
e66b39af00 workqueue: Fix pwq ref leak in rescuer_thread()
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration.  Unfortunately, it doesn't check whether the pwq is
already on the mayday list and unconditionally gets the ref and moves
it onto the list.  This doesn't corrupt the list but creates an
additional reference to the pwq.  It got queued twice but will only be
removed once.

This leak later can trigger pwq refcnt warning on workqueue
destruction and prevent freeing of the workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # v3.19+
2019-10-04 10:23:11 -07:00
Tejun Heo
c29eb85386 workqueue: more destroy_workqueue() fixes
destroy_workqueue() warnings still, at a lower frequency, trigger
spuriously.  The problem seems to be in-flight operations which
haven't reached put_pwq() yet.

* Make sanity check grab all the related locks so that it's
  synchronized against operations which puts pwq at the end.

* Always print out the offending pwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
2019-10-04 10:23:01 -07:00
Linus Torvalds
af0622f6ae for-linus-20191003
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXZZKNgAKCRCRxhvAZXjc
 otfIAPsHUZn+Wfa/8uftNDJ6RLDXDsq6l8xiQTkz+k4YdnDj2AD/aIPjrM950jrS
 W7+8R7CSSQOLmIif6R+S0A1fyFoVlQA=
 =HVz0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3/pidfd fixes from Christian Brauner:
 "This contains a couple of fixes:

   - Fix pidfd selftest compilation (Shuah Kahn)

     Due to a false linking instruction in the Makefile compilation for
     the pidfd selftests would fail on some systems.

   - Fix compilation for glibc on RISC-V systems (Seth Forshee)

     In some scenarios linux/uapi/linux/sched.h is included where
     __ASSEMBLY__ is defined causing a build failure because struct
     clone_args was not guarded by an #ifndef __ASSEMBLY__.

   - Add missing clone3() and struct clone_args kernel-doc (Christian Brauner)

     clone3() and struct clone_args were missing kernel-docs. (The goal
     is to use kernel-doc for any function or type where it's worth it.)
     For struct clone_args this also contains a comment about the fact
     that it's versioned by size"

* tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  sched: add kernel-doc for struct clone_args
  fork: add kernel-doc for clone3
  selftests: pidfd: Fix undefined reference to pthread_create()
  sched: Add __ASSEMBLY__ guards around struct clone_args
2019-10-04 10:18:56 -07:00
Christian Brauner
501bd0166e
fork: add kernel-doc for clone3
Add kernel-doc for the clone3() syscall.

Link: https://lore.kernel.org/r/20191001114701.24661-2-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-03 21:18:06 +02:00
Kees Cook
245d73698e audit: Report suspicious O_CREAT usage
This renames the very specific audit_log_link_denied() to
audit_log_path_denied() and adds the AUDIT_* type as an argument. This
allows for the creation of the new AUDIT_ANOM_CREAT that can be used to
report the fifo/regular file creation restrictions that were introduced
in commit 30aba6656f ("namei: allow restricted O_CREAT of FIFOs and
regular files").

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-10-03 13:59:29 -04:00
Linus Torvalds
5021b9182e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a broadcast-timer handling race that can result in spuriously and
  indefinitely delayed hrtimers and even RCU stalls if the system is
  otherwise quiet"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: broadcast-hrtimer: Fix a race in bc_set_next
2019-10-02 15:54:19 -07:00
Peter Zijlstra
73956fc07d membarrier: Fix RCU locking bug caused by faulty merge
The following commit:

  227a4aadc7 ("sched/membarrier: Fix p->mm->membarrier_state racy load")

got fat fingered by me when merging it with other patches. It meant to move
the RCU section out of the for loop but ended up doing it partially, leaving
a superfluous rcu_read_lock() inside, causing havok.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-tip-commits@vger.kernel.org
Fixes: 227a4aadc7 ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Link: https://lkml.kernel.org/r/20191001085033.GP4519@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-01 21:27:50 +02:00
Aleksa Sarai
c2ba8f41ad perf_event_open: switch to copy_struct_from_user()
Switch perf_event_open() syscall from it's own copying
struct perf_event_attr from userspace to the new dedicated
copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-5-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:22 +02:00
Aleksa Sarai
dff3a85fec sched_setattr: switch to copy_struct_from_user()
Switch sched_setattr() syscall from it's own copying struct sched_attr
from userspace to the new dedicated copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d ("sched/core: Fix uclamp ABI bug, clean up and
     robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-4-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:17 +02:00
Aleksa Sarai
f14c234b4b clone3: switch to copy_struct_from_user()
Switch clone3() syscall from it's own copying struct clone_args from
userspace to the new dedicated copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-3-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:10 +02:00
Iurii Zaikin
2cb80dbbba kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
KUnit tests for initialized data behavior of proc_dointvec that is
explicitly checked in the code. Includes basic parsing tests including
int min/max overflow.

Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Signed-off-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-09-30 17:35:01 -06:00
Linus Torvalds
cf4f493b10 A few more tracing fixes:
- Fixed a buffer overflow by checking nr_args correctly in probes
 
  - Fixed a warning that is reported by clang
 
  - Fixed a possible memory leak in error path of filter processing
 
  - Fixed the selftest that checks for failures, but wasn't failing
 
  - Minor clean up on call site output of a memory trace event
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXZEP5hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhrSAQDlws8rY/vJN4tKL1YaBTRyS5OW+1B+
 LPLOxm9PBuzt0wEArVunv7iMgvRzp5spbmCqmD8Is2vSf+45KSrb10WU2wo=
 =L37R
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few more tracing fixes:

   - Fix a buffer overflow by checking nr_args correctly in probes

   - Fix a warning that is reported by clang

   - Fix a possible memory leak in error path of filter processing

   - Fix the selftest that checks for failures, but wasn't failing

   - Minor clean up on call site output of a memory trace event"

* tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  selftests/ftrace: Fix same probe error test
  mm, tracing: Print symbol name for call_site in trace events
  tracing: Have error path in predicate_parse() free its allocated memory
  tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
  tracing/probe: Fix to check the difference of nr_args before adding probe
2019-09-30 09:29:53 -07:00
Linus Torvalds
02dc96ef6c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Sanity check URB networking device parameters to avoid divide by
    zero, from Oliver Neukum.

 2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
    don't work properly. Longer term this needs a better fix tho. From
    Vijay Khemka.

 3) Small fixes to selftests (use ping when ping6 is not present, etc.)
    from David Ahern.

 4) Bring back rt_uses_gateway member of struct rtable, it's semantics
    were not well understood and trying to remove it broke things. From
    David Ahern.

 5) Move usbnet snaity checking, ignore endpoints with invalid
    wMaxPacketSize. From Bjørn Mork.

 6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.

 7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
    Alex Vesker, and Yevgeny Kliteynik

 8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.

 9) Fix crash when removing sch_cbs entry while offloading is enabled,
    from Vinicius Costa Gomes.

10) Signedness bug fixes, generally in looking at the result given by
    of_get_phy_mode() and friends. From Dan Crapenter.

11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.

12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.

13) Fix quantization code in tcp_bbr, from Kevin Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
  net: tap: clean up an indentation issue
  nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
  tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  sk_buff: drop all skb extensions on free and skb scrubbing
  tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
  mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
  Documentation: Clarify trap's description
  mlxsw: spectrum: Clear VLAN filters during port initialization
  net: ena: clean up indentation issue
  NFC: st95hf: clean up indentation issue
  net: phy: micrel: add Asym Pause workaround for KSZ9021
  net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
  ptp: correctly disable flags on old ioctls
  lib: dimlib: fix help text typos
  net: dsa: microchip: Always set regmap stride to 1
  nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
  nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
  net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
  vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
  net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
  ...
2019-09-28 17:47:33 -07:00
Navid Emamdoost
96c5c6e6a5 tracing: Have error path in predicate_parse() free its allocated memory
In predicate_parse, there is an error path that is not going to
out_free instead it returns directly which leads to a memory leak.

Link: http://lkml.kernel.org/r/20190920225800.3870-1-navid.emamdoost@gmail.com

Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:13:39 -04:00
Nathan Chancellor
968e517093 tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
After r372664 in clang, the IF_ASSIGN macro causes a couple hundred
warnings along the lines of:

kernel/trace/trace_output.c:1331:2: warning: converting the enum
constant to a boolean [-Wint-in-bool-context]
kernel/trace/trace.h:409:3: note: expanded from macro
'trace_assign_type'
                IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,
                ^
kernel/trace/trace.h:371:14: note: expanded from macro 'IF_ASSIGN'
                WARN_ON(id && (entry)->type != id);     \
                           ^
264 warnings generated.

This warning can catch issues with constructs like:

    if (state == A || B)

where the developer really meant:

    if (state == A || state == B)

This is currently the only occurrence of the warning in the kernel
tree across defconfig, allyesconfig, allmodconfig for arm32, arm64,
and x86_64. Add the implicit '!= 0' to the WARN_ON statement to fix
the warnings and find potential issues in the future.

Link: 28b38c277a
Link: https://github.com/ClangBuiltLinux/linux/issues/686
Link: http://lkml.kernel.org/r/20190926162258.466321-1-natechancellor@gmail.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:13:39 -04:00
Masami Hiramatsu
d2aea95a1a tracing/probe: Fix to check the difference of nr_args before adding probe
Steven reported that a test triggered:

==================================================================
 BUG: KASAN: slab-out-of-bounds in trace_kprobe_create+0xa9e/0xe40
 Read of size 8 at addr ffff8880c4f25a48 by task ftracetest/4798

 CPU: 2 PID: 4798 Comm: ftracetest Not tainted 5.3.0-rc6-test+ #30
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 Call Trace:
  dump_stack+0x7c/0xc0
  ? trace_kprobe_create+0xa9e/0xe40
  print_address_description+0x6c/0x332
  ? trace_kprobe_create+0xa9e/0xe40
  ? trace_kprobe_create+0xa9e/0xe40
  __kasan_report.cold.6+0x1a/0x3b
  ? trace_kprobe_create+0xa9e/0xe40
  kasan_report+0xe/0x12
  trace_kprobe_create+0xa9e/0xe40
  ? print_kprobe_event+0x280/0x280
  ? match_held_lock+0x1b/0x240
  ? find_held_lock+0xac/0xd0
  ? fs_reclaim_release.part.112+0x5/0x20
  ? lock_downgrade+0x350/0x350
  ? kasan_unpoison_shadow+0x30/0x40
  ? __kasan_kmalloc.constprop.6+0xc1/0xd0
  ? trace_kprobe_create+0xe40/0xe40
  ? trace_kprobe_create+0xe40/0xe40
  create_or_delete_trace_kprobe+0x2e/0x60
  trace_run_command+0xc3/0xe0
  ? trace_panic_handler+0x20/0x20
  ? kasan_unpoison_shadow+0x30/0x40
  trace_parse_run_command+0xdc/0x163
  vfs_write+0xe1/0x240
  ksys_write+0xba/0x150
  ? __ia32_sys_read+0x50/0x50
  ? tracer_hardirqs_on+0x61/0x180
  ? trace_hardirqs_off_caller+0x43/0x110
  ? mark_held_locks+0x29/0xa0
  ? do_syscall_64+0x14/0x260
  do_syscall_64+0x68/0x260

Fix to check the difference of nr_args before adding probe
on existing probes. This also may set the error log index
bigger than the number of command parameters. In that case
it sets the error position is next to the last parameter.

Link: http://lkml.kernel.org/r/156966474783.3478.13217501608215769150.stgit@devnote2

Fixes: ca89bc071d ("tracing/kprobe: Add multi-probe per event support")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:07:53 -04:00
Linus Torvalds
9c5efe9ae7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-28 12:39:07 -07:00
Linus Torvalds
aefcf2f4b5 Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
 "This is the latest iteration of the kernel lockdown patchset, from
  Matthew Garrett, David Howells and others.

  From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

  There are two major changes since this was last proposed for mainline:

   - Separating lockdown from EFI secure boot. Background discussion is
     covered here: https://lwn.net/Articles/751061/

   -  Implementation as an LSM, with a default stackable lockdown LSM
      module. This allows the lockdown feature to be policy-driven,
      rather than encoding an implicit policy within the mechanism.

  The new locked_down LSM hook is provided to allow LSMs to make a
  policy decision around whether kernel functionality that would allow
  tampering with or examining the runtime state of the kernel should be
  permitted.

  The included lockdown LSM provides an implementation with a simple
  policy intended for general purpose use. This policy provides a coarse
  level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

  Enable the kernel lockdown feature. If set to integrity, kernel features
  that allow userland to modify the running kernel are disabled. If set to
  confidentiality, kernel features that allow userland to extract
  confidential information from the kernel are also disabled.

  This may also be controlled via /sys/kernel/security/lockdown and
  overriden by kernel configuration.

  New or existing LSMs may implement finer-grained controls of the
  lockdown features. Refer to the lockdown_reason documentation in
  include/linux/security.h for details.

  The lockdown feature has had signficant design feedback and review
  across many subsystems. This code has been in linux-next for some
  weeks, with a few fixes applied along the way.

  Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
  when kernel lockdown is in confidentiality mode") is missing a
  Signed-off-by from its author. Matthew responded that he is providing
  this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
  kexec: Fix file verification on S390
  security: constify some arrays in lockdown LSM
  lockdown: Print current->comm in restriction messages
  efi: Restrict efivar_ssdt_load when the kernel is locked down
  tracefs: Restrict tracefs when the kernel is locked down
  debugfs: Restrict debugfs when the kernel is locked down
  kexec: Allow kexec_file() with appropriate IMA policy when locked down
  lockdown: Lock down perf when in confidentiality mode
  bpf: Restrict bpf when kernel lockdown is in confidentiality mode
  lockdown: Lock down tracing and perf kprobes when in confidentiality mode
  lockdown: Lock down /proc/kcore
  x86/mmiotrace: Lock down the testmmiotrace module
  lockdown: Lock down module params that specify hardware parameters (eg. ioport)
  lockdown: Lock down TIOCSSERIAL
  lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
  acpi: Disable ACPI table override if the kernel is locked down
  acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
  ACPI: Limit access to custom_method when the kernel is locked down
  x86/msr: Restrict MSR access when the kernel is locked down
  x86: Lock down IO port access when the kernel is locked down
  ...
2019-09-28 08:14:15 -07:00
Linus Torvalds
f1f2f614d5 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
 "The major feature in this time is IMA support for measuring and
  appraising appended file signatures. In addition are a couple of bug
  fixes and code cleanup to use struct_size().

  In addition to the PE/COFF and IMA xattr signatures, the kexec kernel
  image may be signed with an appended signature, using the same
  scripts/sign-file tool that is used to sign kernel modules.

  Similarly, the initramfs may contain an appended signature.

  This contained a lot of refactoring of the existing appended signature
  verification code, so that IMA could retain the existing framework of
  calculating the file hash once, storing it in the IMA measurement list
  and extending the TPM, verifying the file's integrity based on a file
  hash or signature (eg. xattrs), and adding an audit record containing
  the file hash, all based on policy. (The IMA support for appended
  signatures patch set was posted and reviewed 11 times.)

  The support for appended signature paves the way for adding other
  signature verification methods, such as fs-verity, based on a single
  system-wide policy. The file hash used for verifying the signature and
  the signature, itself, can be included in the IMA measurement list"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: ima_api: Use struct_size() in kzalloc()
  ima: use struct_size() in kzalloc()
  sefltest/ima: support appended signatures (modsig)
  ima: Fix use after free in ima_read_modsig()
  MODSIGN: make new include file self contained
  ima: fix freeing ongoing ahash_request
  ima: always return negative code for error
  ima: Store the measurement again when appraising a modsig
  ima: Define ima-modsig template
  ima: Collect modsig
  ima: Implement support for module-style appended signatures
  ima: Factor xattr_verify() out of ima_appraise_measurement()
  ima: Add modsig appraise_type option for module-style appended signatures
  integrity: Select CONFIG_KEYS instead of depending on it
  PKCS#7: Introduce pkcs7_get_digest()
  PKCS#7: Refactor verify_pkcs7_signature()
  MODSIGN: Export module signature definitions
  ima: initialize the "template" field with the default template
2019-09-27 19:37:27 -07:00
Linus Torvalds
8bbe0dec38 x86 KVM changes:
* The usual accuracy improvements for nested virtualization
 * The usual round of code cleanups from Sean
 * Added back optimizations that were prematurely removed in 5.2
   (the bare minimum needed to fix the regression was in 5.3-rc8,
   here comes the rest)
 * Support for UMWAIT/UMONITOR/TPAUSE
 * Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM
 * Tell Windows guests if SMT is disabled on the host
 * More accurate detection of vmexit cost
 * Revert a pvqspinlock pessimization
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdjfaKAAoJEL/70l94x66D8MAH/2thJnM47tYtMTFA4GBFugeH
 mAx8OApWFBo8apOip+8ElFLPQ8FQdZCzr9ti8H4JkuzKxgsxCs1iqEg5pHEKxSTi
 K9kLOZwoFtwgy3XmxC0PIZ9lT2Wx74ruh1HF+QG/YsjKH636UPv2VpmulsTNbm62
 2ryzOb3TlGT/cjf+gv9l6IYIxZa2Ff19PF4i//H8u4YRBj358/jr99CK01iE0M9r
 4NhEKiQZywzREWtKxymGOM6HEbwbWcIa+loYjj2htq8epep6f9Y1zQ0Jcn5+nPA0
 cn1T2gGJAJ0OUahKLwNbz8pzrFDkW+eoQgqCBJZ4RT9Uf8WCESfl14p+/vRkAMg=
 =tk5S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "x86 KVM changes:

   - The usual accuracy improvements for nested virtualization

   - The usual round of code cleanups from Sean

   - Added back optimizations that were prematurely removed in 5.2 (the
     bare minimum needed to fix the regression was in 5.3-rc8, here
     comes the rest)

   - Support for UMWAIT/UMONITOR/TPAUSE

   - Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM

   - Tell Windows guests if SMT is disabled on the host

   - More accurate detection of vmexit cost

   - Revert a pvqspinlock pessimization"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (56 commits)
  KVM: nVMX: cleanup and fix host 64-bit mode checks
  KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386
  KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot()
  KVM: x86: Drop ____kvm_handle_fault_on_reboot()
  KVM: VMX: Add error handling to VMREAD helper
  KVM: VMX: Optimize VMX instruction error and fault handling
  KVM: x86: Check kvm_rebooting in kvm_spurious_fault()
  KVM: selftests: fix ucall on x86
  Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
  kvm: nvmx: limit atomic switch MSRs
  kvm: svm: Intercept RDPRU
  kvm: x86: Add "significant index" flag to a few CPUID leaves
  KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero
  KVM: x86/mmu: Explicitly track only a single invalid mmu generation
  KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"
  KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all pages""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints""
  ...
2019-09-27 12:44:26 -07:00
Balasubramani Vivekanandan
b9023b91dd tick: broadcast-hrtimer: Fix a race in bc_set_next
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.

The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.

In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.

Fixes: 5d1638acb9 ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.com
2019-09-27 14:45:55 +02:00
Allan Zhang
768fb61fcc bpf: Fix bpf_event_output re-entry issue
BPF_PROG_TYPE_SOCK_OPS program can reenter bpf_event_output because it
can be called from atomic and non-atomic contexts since we don't have
bpf_prog_active to prevent it happen.

This patch enables 3 levels of nesting to support normal, irq and nmi
context.

We can easily reproduce the issue by running netperf crr mode with 100
flows and 10 threads from netperf client side.

Here is the whole stack dump:

[  515.228898] WARNING: CPU: 20 PID: 14686 at kernel/trace/bpf_trace.c:549 bpf_event_output+0x1f9/0x220
[  515.228903] CPU: 20 PID: 14686 Comm: tcp_crr Tainted: G        W        4.15.0-smp-fixpanic #44
[  515.228904] Hardware name: Intel TBG,ICH10/Ikaria_QC_1b, BIOS 1.22.0 06/04/2018
[  515.228905] RIP: 0010:bpf_event_output+0x1f9/0x220
[  515.228906] RSP: 0018:ffff9a57ffc03938 EFLAGS: 00010246
[  515.228907] RAX: 0000000000000012 RBX: 0000000000000001 RCX: 0000000000000000
[  515.228907] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffffffff836b0f80
[  515.228908] RBP: ffff9a57ffc039c8 R08: 0000000000000004 R09: 0000000000000012
[  515.228908] R10: ffff9a57ffc1de40 R11: 0000000000000000 R12: 0000000000000002
[  515.228909] R13: ffff9a57e13bae00 R14: 00000000ffffffff R15: ffff9a57ffc1e2c0
[  515.228910] FS:  00007f5a3e6ec700(0000) GS:ffff9a57ffc00000(0000) knlGS:0000000000000000
[  515.228910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  515.228911] CR2: 0000537082664fff CR3: 000000061fed6002 CR4: 00000000000226f0
[  515.228911] Call Trace:
[  515.228913]  <IRQ>
[  515.228919]  [<ffffffff82c6c6cb>] bpf_sockopt_event_output+0x3b/0x50
[  515.228923]  [<ffffffff8265daee>] ? bpf_ktime_get_ns+0xe/0x10
[  515.228927]  [<ffffffff8266fda5>] ? __cgroup_bpf_run_filter_sock_ops+0x85/0x100
[  515.228930]  [<ffffffff82cf90a5>] ? tcp_init_transfer+0x125/0x150
[  515.228933]  [<ffffffff82cf9159>] ? tcp_finish_connect+0x89/0x110
[  515.228936]  [<ffffffff82cf98e4>] ? tcp_rcv_state_process+0x704/0x1010
[  515.228939]  [<ffffffff82c6e263>] ? sk_filter_trim_cap+0x53/0x2a0
[  515.228942]  [<ffffffff82d90d1f>] ? tcp_v6_inbound_md5_hash+0x6f/0x1d0
[  515.228945]  [<ffffffff82d92160>] ? tcp_v6_do_rcv+0x1c0/0x460
[  515.228947]  [<ffffffff82d93558>] ? tcp_v6_rcv+0x9f8/0xb30
[  515.228951]  [<ffffffff82d737c0>] ? ip6_route_input+0x190/0x220
[  515.228955]  [<ffffffff82d5f7ad>] ? ip6_protocol_deliver_rcu+0x6d/0x450
[  515.228958]  [<ffffffff82d60246>] ? ip6_rcv_finish+0xb6/0x170
[  515.228961]  [<ffffffff82d5fb90>] ? ip6_protocol_deliver_rcu+0x450/0x450
[  515.228963]  [<ffffffff82d60361>] ? ipv6_rcv+0x61/0xe0
[  515.228966]  [<ffffffff82d60190>] ? ipv6_list_rcv+0x330/0x330
[  515.228969]  [<ffffffff82c4976b>] ? __netif_receive_skb_one_core+0x5b/0xa0
[  515.228972]  [<ffffffff82c497d1>] ? __netif_receive_skb+0x21/0x70
[  515.228975]  [<ffffffff82c4a8d2>] ? process_backlog+0xb2/0x150
[  515.228978]  [<ffffffff82c4aadf>] ? net_rx_action+0x16f/0x410
[  515.228982]  [<ffffffff830000dd>] ? __do_softirq+0xdd/0x305
[  515.228986]  [<ffffffff8252cfdc>] ? irq_exit+0x9c/0xb0
[  515.228989]  [<ffffffff82e02de5>] ? smp_call_function_single_interrupt+0x65/0x120
[  515.228991]  [<ffffffff82e020e1>] ? call_function_single_interrupt+0x81/0x90
[  515.228992]  </IRQ>
[  515.228996]  [<ffffffff82a11ff0>] ? io_serial_in+0x20/0x20
[  515.229000]  [<ffffffff8259c040>] ? console_unlock+0x230/0x490
[  515.229003]  [<ffffffff8259cbaa>] ? vprintk_emit+0x26a/0x2a0
[  515.229006]  [<ffffffff8259cbff>] ? vprintk_default+0x1f/0x30
[  515.229008]  [<ffffffff8259d9f5>] ? vprintk_func+0x35/0x70
[  515.229011]  [<ffffffff8259d4bb>] ? printk+0x50/0x66
[  515.229013]  [<ffffffff82637637>] ? bpf_event_output+0xb7/0x220
[  515.229016]  [<ffffffff82c6c6cb>] ? bpf_sockopt_event_output+0x3b/0x50
[  515.229019]  [<ffffffff8265daee>] ? bpf_ktime_get_ns+0xe/0x10
[  515.229023]  [<ffffffff82c29e87>] ? release_sock+0x97/0xb0
[  515.229026]  [<ffffffff82ce9d6a>] ? tcp_recvmsg+0x31a/0xda0
[  515.229029]  [<ffffffff8266fda5>] ? __cgroup_bpf_run_filter_sock_ops+0x85/0x100
[  515.229032]  [<ffffffff82ce77c1>] ? tcp_set_state+0x191/0x1b0
[  515.229035]  [<ffffffff82ced10e>] ? tcp_disconnect+0x2e/0x600
[  515.229038]  [<ffffffff82cecbbb>] ? tcp_close+0x3eb/0x460
[  515.229040]  [<ffffffff82d21082>] ? inet_release+0x42/0x70
[  515.229043]  [<ffffffff82d58809>] ? inet6_release+0x39/0x50
[  515.229046]  [<ffffffff82c1f32d>] ? __sock_release+0x4d/0xd0
[  515.229049]  [<ffffffff82c1f3e5>] ? sock_close+0x15/0x20
[  515.229052]  [<ffffffff8273b517>] ? __fput+0xe7/0x1f0
[  515.229055]  [<ffffffff8273b66e>] ? ____fput+0xe/0x10
[  515.229058]  [<ffffffff82547bf2>] ? task_work_run+0x82/0xb0
[  515.229061]  [<ffffffff824086df>] ? exit_to_usermode_loop+0x7e/0x11f
[  515.229064]  [<ffffffff82408171>] ? do_syscall_64+0x111/0x130
[  515.229067]  [<ffffffff82e0007c>] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: a5a3a828cd ("bpf: add perf event notificaton support for sock_ops")
Signed-off-by: Allan Zhang <allanzhang@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20190925234312.94063-2-allanzhang@google.com
2019-09-27 11:24:29 +02:00
Linus Torvalds
da05b5ea12 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a timer expiry bug that would cause spurious delay of timers"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Read jiffies once when forwarding base clk
2019-09-26 15:53:17 -07:00
Linus Torvalds
a7b7b772bb Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "The only kernel change is comment typo fixes.

  The rest is mostly tooling fixes, but also new vendor event additions
  and updates, a bigger libperf/libtraceevent library and a header files
  reorganization that came in a bit late"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (108 commits)
  perf unwind: Fix libunwind build failure on i386 systems
  perf parser: Remove needless include directives
  perf build: Add detection of java-11-openjdk-devel package
  perf jvmti: Include JVMTI support for s390
  perf vendor events: Remove P8 HW events which are not supported
  perf evlist: Fix access of freed id arrays
  perf stat: Fix free memory access / memory leaks in metrics
  perf tools: Replace needless mmap.h with what is needed, event.h
  perf evsel: Move config terms to a separate header
  perf evlist: Remove unused perf_evlist__fprintf() method
  perf evsel: Introduce evsel_fprintf.h
  perf evsel: Remove need for symbol_conf in evsel_fprintf.c
  perf copyfile: Move copyfile routines to separate files
  libperf: Add perf_evlist__poll() function
  libperf: Add perf_evlist__add_pollfd() function
  libperf: Add perf_evlist__alloc_pollfd() function
  libperf: Add libperf_init() call to the tests
  libperf: Merge libperf_set_print() into libperf_init()
  libperf: Add libperf dependency for tests targets
  libperf: Use sys/types.h to get ssize_t, not unistd.h
  ...
2019-09-26 15:38:07 -07:00
Linus Torvalds
7897c04ad0 Srikar Dronamraju fixed a bug in the newmulti probe code.
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYvAlBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlK6APsECr49j3ew/tRCnzkq0Y09w0TLYeHL
 ax6aAVO1fHX0TgEAhCBwkWh8ZcoxGbu1CDOkQjJAqfTFFSu38Klv1P+3PQg=
 =oSuX
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Srikar Dronamraju fixed a bug in the newmulti probe code"

* tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/probe: Fix same probe event argument matching
2019-09-26 13:07:38 -07:00
Colin Ian King
e3439af4a3 bpf: Clean up indentation issue in BTF kflag processing
There is a statement that is indented one level too deeply, remove
the extraneous tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20190925093835.19515-1-colin.king@canonical.com
2019-09-26 17:09:18 +02:00
Kees Cook
2da1ead4d5 bug: consolidate __WARN_FLAGS usage
Instead of having separate tests for __WARN_FLAGS, merge the two #ifdef
blocks and replace the synonym WANT_WARN_ON_SLOWPATH macro.

Link: http://lkml.kernel.org/r/20190819234111.9019-7-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:41 -07:00
Kees Cook
d38aba49a9 bug: lift "cut here" out of __warn()
In preparation for cleaning up "cut here", move the "cut here" logic up
out of __warn() and into callers that pass non-NULL args.  For anyone
looking closely, there are two callers that pass NULL args: one already
explicitly prints "cut here".  The remaining case is covered by how a WARN
is built, which will be cleaned up in the next patch.

Link: http://lkml.kernel.org/r/20190819234111.9019-5-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Kees Cook
f2f84b05e0 bug: consolidate warn_slowpath_fmt() usage
Instead of having a separate helper for no printk output, just consolidate
the logic into warn_slowpath_fmt().

Link: http://lkml.kernel.org/r/20190819234111.9019-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Kees Cook
ee8711336c bug: refactor away warn_slowpath_fmt_taint()
Patch series "Clean up WARN() "cut here" handling", v2.

Christophe Leroy noticed that the fix for missing "cut here" in the WARN()
case was adding explicit printk() calls instead of teaching the exception
handler to add it.  This refactors the bug/warn infrastructure to pass
this information as a new BUGFLAG.

Longer details repeated from the last patch in the series:

bug: move WARN_ON() "cut here" into exception handler

The original cleanup of "cut here" missed the WARN_ON() case (that does
not have a printk message), which was fixed recently by adding an explicit
printk of "cut here".  This had the downside of adding a printk() to every
WARN_ON() caller, which reduces the utility of using an instruction
exception to streamline the resulting code.  By making this a new BUGFLAG,
all of these can be removed and "cut here" can be handled by the exception
handler.

This was very pronounced on PowerPC, but the effect can be seen on x86 as
well.  The resulting text size of a defconfig build shows some small
savings from this patch:

   text    data     bss     dec     hex filename
19691167        5134320 1646664 26472151        193eed7 vmlinux.before
19676362        5134260 1663048 26473670        193f4c6 vmlinux.after

This change also opens the door for creating something like BUG_MSG(),
where a custom printk() before issuing BUG(), without confusing the "cut
here" line.

This patch (of 7):

There's no reason to have specialized helpers for passing the warn taint
down to __warn().  Consolidate and refactor helper macros, removing
__WARN_printf() and warn_slowpath_fmt_taint().

Link: http://lkml.kernel.org/r/20190819234111.9019-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Douglas Anderson
7d92bda271 kgdb: don't use a notifier to enter kgdb at panic; call directly
Right now kgdb/kdb hooks up to debug panics by registering for the panic
notifier.  This works OK except that it means that kgdb/kdb gets called
_after_ the CPUs in the system are taken offline.  That means that if
anything important was happening on those CPUs (like something that might
have contributed to the panic) you can't debug them.

Specifically I ran into a case where I got a panic because a task was
"blocked for more than 120 seconds" which was detected on CPU 2.  I nicely
got shown stack traces in the kernel log for all CPUs including CPU 0,
which was running 'PID: 111 Comm: kworker/0:1H' and was in the middle of
__mmc_switch().

I then ended up at the kdb prompt where switched over to kgdb to try to
look at local variables of the process on CPU 0.  I found that I couldn't.
Digging more, I found that I had no info on any tasks running on CPUs
other than CPU 2 and that asking kdb for help showed me "Error: no saved
data for this cpu".  This was because all the CPUs were offline.

Let's move the entry of kdb/kgdb to a direct call from panic() and stop
using the generic notifier.  Putting a direct call in allows us to order
things more properly and it also doesn't seem like we're breaking any
abstractions by calling into the debugger from the panic function.

Daniel said:

: This patch changes the way kdump and kgdb interact with each other.
: However it would seem rather odd to have both tools simultaneously armed
: and, even if they were, the user still has the option to use panic_timeout
: to force a kdump to happen.  Thus I think the change of order is
: acceptable.

Link: http://lkml.kernel.org/r/20190703170354.217312-1-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Feng Tang <feng.tang@intel.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Tetsuo Handa
7c3a6aedcd kexec: bail out upon SIGKILL when allocating memory.
syzbot found that a thread can stall for minutes inside kexec_load() after
that thread was killed by SIGKILL [1].  It turned out that the reproducer
was trying to allocate 2408MB of memory using kimage_alloc_page() from
kimage_load_normal_segment().  Let's check for SIGKILL before doing memory
allocation.

[1] https://syzkaller.appspot.com/bug?id=a0e3436829698d5824231251fad9d8e998f94f5e

Link: http://lkml.kernel.org/r/993c9185-d324-2640-d061-bed2dd18b1f7@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Sai Praneeth Prakhya
8495f7e673 fork: improve error message for corrupted page tables
When a user process exits, the kernel cleans up the mm_struct of the user
process and during cleanup, check_mm() checks the page tables of the user
process for corruption (E.g: unexpected page flags set/cleared).  For
corrupted page tables, the error message printed by check_mm() isn't very
clear as it prints the loop index instead of page table type (E.g:
Resident file mapping pages vs Resident shared memory pages).  The loop
index in check_mm() is used to index rss_stat[] which represents
individual memory type stats.  Hence, instead of printing index, print
memory type, thereby improving error message.

Without patch:
--------------
[  204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
[  204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
[  204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
[  204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480

With patch:
-----------
[   69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
[   69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
[   69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
[   69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480

Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
that it matches the other print statement.

Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Valdis Kletnieks
0f74914071 kernel/elfcore.c: include proper prototypes
When building with W=1, gcc properly complains that there's no prototypes:

  CC      kernel/elfcore.o
kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
    7 | Elf_Half __weak elf_core_extra_phdrs(void)
      |                 ^~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
   12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
   17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
   22 | size_t __weak elf_core_extra_data_size(void)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~

Provide the include file so gcc is happy, and we don't have potential code drift

Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:39 -07:00
Jonathan Lemon
fcd30ae066 bpf/xskmap: Return ERR_PTR for failure case instead of NULL.
When kzalloc() failed, NULL was returned to the caller, which
tested the pointer with IS_ERR(), which didn't match, so the
pointer was used later, resulting in a NULL dereference.

Return ERR_PTR(-ENOMEM) instead of NULL.

Reported-by: syzbot+491c1b7565ba9069ecae@syzkaller.appspotmail.com
Fixes: 0402acd683 ("xsk: remove AF_XDP socket from map when the socket is released")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-25 22:14:16 +02:00
Quentin Perret
4892f51ad5 sched/fair: Avoid redundant EAS calculation
The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@qperret.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/20190920094115.GA11503@qperret.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Valentin Schneider
9fc41acc89 sched/core: Remove double update_max_interval() call on CPU startup
update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
and teardown callbacks, but it turns out it's also called at the end of
the startup callback of CPUHP_AP_ACTIVE (which is further down the
startup sequence).

There's no point in repeating this interval update in the startup sequence
since the CPU will remain online until it goes down the teardown path.

Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Valentin Schneider
a49b4f4012 sched/core: Fix preempt_schedule() interrupt return comment
preempt_schedule_irq() is the one that should be called on return from
interrupt, clean up the comment to avoid any ambiguity.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-riscv@lists.infradead.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Qian Cai
763a9ec06c sched/fair: Fix -Wunused-but-set-variable warnings
Commit:

   de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pauld@redhat.com
Fixes: de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
KeMeng Shi
714e501e16 sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
An oops can be triggered in the scheduler when running qemu on arm64:

 Unable to handle kernel paging request at virtual address ffff000008effe40
 Internal error: Oops: 96000007 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
 pstate: 20000085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi <shikemeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
c172e0a3e8 sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
Remove the IPI fallback code from membarrier to deal with very
infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
than GFP_NOWAIT, and relax the blocking guarantees for the expedited
membarrier system call commands, allowing it to block if waiting for
memory to be made available.

In addition, now -ENOMEM can be returned to user-space if the cpumask
memory allocation fails.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
c6d68c1c4a sched/membarrier: Skip IPIs when mm->mm_users == 1
If there is only a single mm_user for the mm, the private expedited
membarrier command can skip the IPIs, because only a single thread
is using the mm.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
227a4aadc7 sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers
09554009c0 sched/membarrier: Remove redundant check
Checking that the number of threads is 1 is redundant with checking
mm_users == 1.

No change in functionality intended.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers
fc0d77387c sched/membarrier: Fix private expedited registration check
Fix a logic flaw in the way membarrier_register_private_expedited()
handles ready state checks for private expedited sync core and private
expedited registrations.

If a private expedited membarrier registration is first performed, and
then a private expedited sync_core registration is performed, the ready
state check will skip the second registration when it really should not.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Eric W. Biederman
5311a98fef tasks, sched/core: RCUify the assignment of rq->curr
The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference() on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference().  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
	rq_lock(...);
	smp_mb__after_spinlock();

	next = pick_next_task(...);
	rq->curr = next;

	context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
154abafc68 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
0ff7b2cfba tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
3fbd7ee285 tasks: Add a count of task RCU users
Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Srikar Dronamraju
f8d7ab2bde tracing/probe: Fix same probe event argument matching
Commit fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
tries to reject a event which matches an already existing probe.

However it currently continues to match arguments and rejects adding a
probe even when the arguments don't match. Fix this by only rejecting a
probe if and only if all the arguments match.

Link: http://lkml.kernel.org/r/20190924114906.14038-1-srikar@linux.vnet.ibm.com

Fixes: fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-25 06:34:06 -04:00
Wanpeng Li
89340d0935 Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
This patch reverts commit 75437bb304 (locking/pvqspinlock: Don't
wait if vCPU is preempted).  A large performance regression was caused
by this commit.  on over-subscription scenarios.

The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads,
with three VMs of 80 vCPUs each.  The score of ebizzy -M is reduced from
13000-14000 records/s to 1700-1800 records/s:

          Host                Guest                score

vanilla w/o kvm optimizations     upstream    1700-1800 records/s
vanilla w/o kvm optimizations     revert      13000-14000 records/s
vanilla w/ kvm optimizations      upstream    4500-5000 records/s
vanilla w/ kvm optimizations      revert      14000-15500 records/s

Exit from aggressive wait-early mechanism can result in premature yield
and extra scheduling latency.

Actually, only 6% of wait_early events are caused by vcpu_is_preempted()
being true.  However, when one vCPU voluntarily releases its vCPU, all
the subsequently waiters in the queue will do the same and the cascading
effect leads to bad performance.

kvm optimizations:
[1] commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts)
[2] commit 266e85a5ec (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption)

Tested-by: loobinliu@tencent.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: loobinliu@tencent.com
Cc: stable@vger.kernel.org
Fixes: 75437bb304 (locking/pvqspinlock: Don't wait if vCPU is preempted)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25 10:22:37 +02:00
Linus Torvalds
9c9fa97a8e Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few hot fixes

 - ocfs2 updates

 - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
   cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
   sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
   oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
   zsmalloc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  mm/zsmalloc.c: fix a -Wunused-function warning
  zswap: do not map same object twice
  zswap: use movable memory if zpool support allocate movable memory
  zpool: add malloc_support_movable to zpool_driver
  shmem: fix obsolete comment in shmem_getpage_gfp()
  mm/madvise: reduce code duplication in error handling paths
  mm: mmap: increase sockets maximum memory size pgoff for 32bits
  mm/mmap.c: refine find_vma_prev() with rb_last()
  riscv: make mmap allocation top-down by default
  mips: use generic mmap top-down layout and brk randomization
  mips: replace arch specific way to determine 32bit task with generic version
  mips: adjust brk randomization offset to fit generic version
  mips: use STACK_TOP when computing mmap base address
  mips: properly account for stack randomization and stack guard gap
  arm: use generic mmap top-down layout and brk randomization
  arm: use STACK_TOP when computing mmap base address
  arm: properly account for stack randomization and stack guard gap
  arm64, mm: make randomization selected by generic topdown mmap layout
  arm64, mm: move generic mmap layout functions to mm
  arm64: consider stack randomization for mmap base only when necessary
  ...
2019-09-24 16:10:23 -07:00
Alexandre Ghiti
67f3977f80 arm64, mm: move generic mmap layout functions to mm
arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm.  It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions.  Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.

Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
f385cb85a4 uprobe: collapse THP pmd after removing all uprobes
After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again.  This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
5a52c9df62 uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
Use the newly added FOLL_SPLIT_PMD in uprobe.  This preserves the huge
page when the uprobe is enabled.  When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
fb4fb04ff4 uprobe: use original page when all uprobes are removed
Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint().  When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
David Hildenbrand
00ff9a91bd mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()
Patch series "mm/memory_hotplug: online_pages() cleanups", v2.

Some cleanups (+ one fix for a special case) in the context of
online_pages().

This patch (of 5):

This makes it clearer that we will never call func() with duplicate PFNs
in case we have multiple sub-page memory resources.  All unaligned parts
of PFNs are completely discarded.

Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Nicholas Piggin
13224794cb mm: remove quicklist page table caches
Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

This patch (of 3):

Remove page table allocator "quicklists".  These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore.  If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Linus Torvalds
0b36c9eed2 Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more mount API conversions from Al Viro:
 "Assorted conversions of options parsing to new API.

  gfs2 is probably the most serious one here; the rest is trivial stuff.

  Other things in what used to be #work.mount are going to wait for the
  next cycle (and preferably go via git trees of the filesystems
  involved)"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: Convert gfs2 to fs_context
  vfs: Convert spufs to use the new mount API
  vfs: Convert hypfs to use the new mount API
  hypfs: Fix error number left in struct pointer member
  vfs: Convert functionfs to use the new mount API
  vfs: Convert bpf to use the new mount API
2019-09-24 12:33:34 -07:00
Vitaly Kuznetsov
e1572f1d08 cpu/SMT: create and export cpu_smt_possible()
KVM needs to know if SMT is theoretically possible, this means it is
supported and not forcefully disabled ('nosmt=force'). Create and
export cpu_smt_possible() answering this question.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 13:37:28 +02:00
Linus Torvalds
9f7582d15f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching fix from Jiri Kosina:
 "Error handling fix in livepatching module notifier, from Miroslav
  Benes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Nullify obj->mod in klp_module_coming()'s error path
2019-09-23 12:21:03 -07:00
Linus Torvalds
e070355664 Modules updates for v5.4
Summary of modules changes for the 5.4 merge window:
 
 - Introduce exported symbol namespaces.
 
   This new feature allows subsystem maintainers to partition and
   categorize their exported symbols into explicit namespaces. Module
   authors are now required to import the namespaces they need.
 
   Some of the main motivations of this feature include: allowing kernel
   developers to better manage the export surface, allow subsystem
   maintainers to explicitly state that usage of some exported symbols
   should only be limited to certain users (think: inter-module or
   inter-driver symbols, debugging symbols, etc), as well as more easily
   limiting the availability of namespaced symbols to other parts of the
   kernel. With the module import requirement, it is also easier to spot
   the misuse of exported symbols during patch review. Two new macros are
   introduced: EXPORT_SYMBOL_NS() and EXPORT_SYMBOL_NS_GPL(). The API is
   thoroughly documented in Documentation/kbuild/namespaces.rst.
 
 - Some small code and kbuild cleanups here and there.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdh3n8AAoJEMBFfjjOO8Fy94kP+QHZF39QDvLbxAzEYAETAS+o
 CFu6wix/DrAwFkTU/kX1eAsAwDBEz0xkMciR4BsLX3sIafUVERxtDXVAui/dA1+6
 zfw2c3ObyVwPEk6aUPFprgkj+08gxujsJFlYTsQQUhtRbmxg6R7hD6t6ANxiHaY2
 AQe5TzOWXoIa2hHO+7rPMqf8l6qiFCaL0s3v5jrmBXa5mHmc4PVy95h1J6xQVw2u
 b+SlvKeylHv+OtCtvthkAJS3hfS35J/1TNb/RNYIvh60IfEguEuFsGuQ9JiSSAZS
 pv1cJ+I5d4v8Y/md1rZpdjTJL9gCrq/UUC67+UkejCOn0C+7XM2eR4Bu/jWvdMSn
 ZQDHcPhFSIfmP7FaKomPogaBbw1sI1FvM5930pPJzHnyO9+cefBXe7rWaaB+y0At
 GAxOtmk1dKh01BT7YO/C0oVuX87csWd74NHypVsbs0TgQo5jBFdZRheyDrq5YB+s
 tVK+5H0nqQrCcfo/TvhcsZlgITTGtgTPenaW99/i7qNa9mRUtxC/VkE+aob6HNRF
 1iBxxopOTxGN8akyKOVumtkuTQH3EJfouZee//pWbXLzyDmScg/k67vuao8kxbyq
 NA1piFAGJAHFsHATxrbvNOq6jZ5bfUT8pwSTs83JppuR++8Hxk7zaShS3/LvsvHt
 6ist/epOwTZ7oiNQ04nj
 =72Uy
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "The main bulk of this pull request introduces a new exported symbol
  namespaces feature. The number of exported symbols is increasingly
  growing with each release (we're at about 31k exports as of 5.3-rc7)
  and we currently have no way of visualizing how these symbols are
  "clustered" or making sense of this huge export surface.

  Namespacing exported symbols allows kernel developers to more
  explicitly partition and categorize exported symbols, as well as more
  easily limiting the availability of namespaced symbols to other parts
  of the kernel. For starters, we have introduced the USB_STORAGE
  namespace to demonstrate the API's usage. I have briefly summarized
  the feature and its main motivations in the tag below.

  Summary:

   - Introduce exported symbol namespaces.

     This new feature allows subsystem maintainers to partition and
     categorize their exported symbols into explicit namespaces. Module
     authors are now required to import the namespaces they need.

     Some of the main motivations of this feature include: allowing
     kernel developers to better manage the export surface, allow
     subsystem maintainers to explicitly state that usage of some
     exported symbols should only be limited to certain users (think:
     inter-module or inter-driver symbols, debugging symbols, etc), as
     well as more easily limiting the availability of namespaced symbols
     to other parts of the kernel.

     With the module import requirement, it is also easier to spot the
     misuse of exported symbols during patch review.

     Two new macros are introduced: EXPORT_SYMBOL_NS() and
     EXPORT_SYMBOL_NS_GPL(). The API is thoroughly documented in
     Documentation/kbuild/namespaces.rst.

   - Some small code and kbuild cleanups here and there"

* tag 'modules-for-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Remove leftover '#undef' from export header
  module: remove unneeded casts in cmp_name()
  module: move CONFIG_UNUSED_SYMBOLS to the sub-menu of MODULES
  module: remove redundant 'depends on MODULES'
  module: Fix link failure due to invalid relocation on namespace offset
  usb-storage: export symbols in USB_STORAGE namespace
  usb-storage: remove single-use define for debugging
  docs: Add documentation for Symbol Namespaces
  scripts: Coccinelle script for namespace dependencies.
  modpost: add support for generating namespace dependencies
  export: allow definition default namespaces in Makefiles or sources
  module: add config option MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
  modpost: add support for symbol namespaces
  module: add support for symbol namespaces.
  export: explicitly align struct kernel_symbol
  module: support reading multiple values per modinfo tag
2019-09-22 10:34:46 -07:00
Linus Torvalds
9dca3432ee This pull request contains the following changes for UML:
- virtio support
 - Fixes for our new time travel mode
 - Various improvements to make lockdep and kasan work better
 - SPDX header updates
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl2Fx9kWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wa2kD/9UJ5JOe6yBeMfPO5Vv8vpJRc10
 0gS8qDbzfutrWddq1wUvEaNCIQY4NOf4tsqjauYHpTUA/0AWwruz++iyI9u3XWEQ
 0b+ZMhKXkws3UgPwWIxrgLr0106wz6Xuz6d36nqpAc6F4MJhC3LqUCC9yEp3hxMd
 pSF65ueQXp7NKfOAqqKU1m3FnfmyBTpsL5PpA6OEZn//kt/Qz5PhIjHpC3JwIBQb
 z0OUhE/6mmWb66wtqHIx4Zd2ybLLnsfby24q+1e8J2B+gcORxhubvgCIGY+PU98o
 EW3N4aMevUdgG9MJbnlZUgWeZ1bsByail2z8aFElRKefT2xkEnjxfQZgKahI6LnO
 jzLm9pk3RjTiZxvYkEbgRAjBkZD514M6FvOlyrHtLxMDfWE6/z71VKDqFjEyeIHQ
 QpDjwEjdJTxVHr4Ol+VnZe1lE5zXLNuCFT5qdPQBqyr8g151T7jwYXnGK2SqGo2D
 UQ6/KnaN+pgM7BaqcNtwciKk3Xjng0BDLfdZs7z8F3bzv53rg2mpQt5iPm+nWFPa
 aNt4B3FKXv3+YnjuSbi5NlvKKK9alRcvZTOk8jFjwOVmFJXlvMCzegZnuTxtqU+j
 XpwmUlsT6aMV7vPZN2ta7y1bjOijzZIjL0O7rP4Obxwfp3dTGGYX/T6vW8F2o9V6
 evyx/KSD6nqlY1bvwQ==
 =oxpp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

 - virtio support

 - fixes for our new time travel mode

 - various improvements to make lockdep and kasan work better

 - SPDX header updates

* tag 'for-linus-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (25 commits)
  um: irq: Fix LAST_IRQ usage in init_IRQ()
  um: Add SPDX headers for files in arch/um/include
  um: Add SPDX headers for files in arch/um/os-Linux
  um: Add SPDX headers to files in arch/um/kernel/
  um: Add SPDX headers for files in arch/um/drivers
  um: virtio: Implement VHOST_USER_PROTOCOL_F_REPLY_ACK
  um: virtio: Implement VHOST_USER_PROTOCOL_F_SLAVE_REQ
  um: drivers: Add virtio vhost-user driver
  um: Use real DMA barriers
  um: Don't use generic barrier.h
  um: time-travel: Restrict time update in IRQ handler
  um: time-travel: Fix periodic timers
  um: Enable CONFIG_CONSTRUCTORS
  um: Place (soft)irq text with macros
  um: Fix VDSO compiler warning
  um: Implement TRACE_IRQFLAGS_SUPPORT
  um: Remove misleading #define ARCh_IRQ_ENABLED
  um: Avoid using uninitialized regs
  um: Remove sig_info[SIGALRM]
  um: Error handling fixes in vector drivers
  ...
2019-09-21 11:07:02 -07:00
Linus Torvalds
84da111de0 hmm related patches for 5.4
This is more cleanup and consolidation of the hmm APIs and the very
 strongly related mmu_notifier interfaces. Many places across the tree
 using these interfaces are touched in the process. Beyond that a cleanup
 to the page walker API and a few memremap related changes round out the
 series:
 
 - General improvement of hmm_range_fault() and related APIs, more
   documentation, bug fixes from testing, API simplification &
   consolidation, and unused API removal
 
 - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
   make them internal kconfig selects
 
 - Hoist a lot of code related to mmu notifier attachment out of drivers by
   using a refcount get/put attachment idiom and remove the convoluted
   mmu_notifier_unregister_no_release() and related APIs.
 
 - General API improvement for the migrate_vma API and revision of its only
   user in nouveau
 
 - Annotate mmu_notifiers with lockdep and sleeping region debugging
 
 Two series unrelated to HMM or mmu_notifiers came along due to
 dependencies:
 
 - Allow pagemap's memremap_pages family of APIs to work without providing
   a struct device
 
 - Make walk_page_range() and related use a constant structure for function
   pointers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
 mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
 Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
 6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
 tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
 nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
 wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
 yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
 EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
 Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
 kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
 olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
 =FRGg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:

   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal

   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects

   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.

   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau

   - Annotate mmu_notifiers with lockdep and sleeping region debugging

  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:

   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device

   - Make walk_page_range() and related use a constant structure for
     function pointers"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ...
2019-09-21 10:07:42 -07:00
Linus Torvalds
56c1e83434 Printk changes for 5.4
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl1/akAACgkQUqAMR0iA
 lPLK+Q//fDKalQodEQU2qwP2eLaVwCn5lKu0ab29JwQgA21i5f1AUibPBLvQoHYX
 CyclUSzylz0AO37V1nSWwmDs80TY0nhbgQUURglKQQpkOyZiU/7FtQnVLm2qUUgp
 Jd6hjbFEGQUdfaY21MXzCY5XgYWIPQ7I0exFBd17cbj4ad9mjs28SSLwtK+ogSxL
 S9pfZdB9DWJBeA+d0N9cA6Q4V+u2w4JPcaoYcsPcxbyf3oxxQJZF6FCkaJjXRG+d
 NDzRKfB5+vBJqLFs8/wjiokuGl7YSJwBEe+q02JrSMEBfJb/9+9Uh8NUy6yIFCdL
 DmZaK6zasb9IW6vGmpfVO2fHoSwXYF1Le8YtEzD37+HZRYFnv6+AvpsOgr7hGPpR
 7Ci6U6yEckMxwtKxiPyf+isUV9R8fPWpWnCdz/jVIVTlSP/o3tA8RG8S+cOd66pK
 /TF2DZoMXdwcpB+MgCeCusM/f8jNkAYQPh74SHZQGcY8gdxu/hkv0URU1IVEkdku
 o/+Y2HvgPh2Pd3fqOQU1QKSUCQ0OcVyAt4fg59YpOQTDr6TMZ6xy/TEuFRzzAe8e
 LNLxsGrN1Kqn3JGTOAZgPnbFc5N1eP9/GoGh8+nQaNJe202QjbTUIKEOioOXu5VR
 aFGPvD5aEdlXj+Ms/Igv8qc5/ogXAg4afkqC8Hkk8XaUXHqt3ZI=
 =Ui4A
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Fix off-by-one error when calculating messages that might fit into
   kmsg buffer. It causes occasional omitting of the last message.

 - Add missing pointer check in %pD format modifier handling.

 - Some clean up

* tag 'printk-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  ABI: Update dev-kmsg documentation to match current kernel behaviour
  printk: Replace strncmp() with str_has_prefix()
  lib/test_printf: Remove obvious comments from %pd and %pD tests
  lib/test_printf: Add test of null/invalid pointer dereference for dentry
  vsprintf: Prevent crash when dereferencing invalid pointers for %pD
  printk: Do not lose last line in kmsg buffer dump
2019-09-21 09:34:29 -07:00
Tejun Heo
30ae2fc0a7 workqueue: Minor follow-ups to the rescuer destruction change
* Now that wq->rescuer may be cleared while rescuer is still there,
  switch show_pwq() debug printout to test worker->rescue_wq to
  identify rescuers intead of testing wq->rescuer.

* Update comment on ->rescuer locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
2019-09-20 14:09:14 -07:00
Tejun Heo
8efe1223d7 workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Qian Cai <cai@lca.pw>
Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")
2019-09-20 13:39:57 -07:00
Roy Ben Shlomo
9f014e3a66 perf/core: Fix several typos in comments
Fix typos in a few functions' documentation comments.

Signed-off-by: Roy Ben Shlomo <royb@sentinelone.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: royb@sentinelone.com
Link: http://lore.kernel.org/lkml/20190920171254.31373-1-royb@sentinelone.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20 16:05:20 -03:00
Linus Torvalds
45824fc0da powerpc updates for 5.4
- Initial support for running on a system with an Ultravisor, which is software
    that runs below the hypervisor and protects guests against some attacks by
    the hypervisor.
 
  - Support for building the kernel to run as a "Secure Virtual Machine", ie. as
    a guest capable of running on a system with an Ultravisor.
 
  - Some changes to our DMA code on bare metal, to allow devices with medium
    sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
 
  - Support for firmware assisted crash dumps on bare metal (powernv).
 
  - Two series fixing bugs in and refactoring our PCI EEH code.
 
  - A large series refactoring our exception entry code to use gas macros, both
    to make it more readable and also enable some future optimisations.
 
 As well as many cleanups and other minor features & fixups.
 
 Thanks to:
   Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
   Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
   Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
   David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
   Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
   Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
   Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
   Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
   Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
   Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
   Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
   Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
 D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
 bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
 DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
 aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
 5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
 j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
 kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
 sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
 j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
 LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
 oP0rHfgmVYvF1g==
 =WlW+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This is a bit late, partly due to me travelling, and partly due to a
  power outage knocking out some of my test systems *while* I was
  travelling.

   - Initial support for running on a system with an Ultravisor, which
     is software that runs below the hypervisor and protects guests
     against some attacks by the hypervisor.

   - Support for building the kernel to run as a "Secure Virtual
     Machine", ie. as a guest capable of running on a system with an
     Ultravisor.

   - Some changes to our DMA code on bare metal, to allow devices with
     medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
     DMA space.

   - Support for firmware assisted crash dumps on bare metal (powernv).

   - Two series fixing bugs in and refactoring our PCI EEH code.

   - A large series refactoring our exception entry code to use gas
     macros, both to make it more readable and also enable some future
     optimisations.

  As well as many cleanups and other minor features & fixups.

  Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
  JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
  Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
  Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
  Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
  Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
  Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
  Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
  Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
  Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
  Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
  Lendacky, Vasant Hegde"

* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
  powerpc/mm/mce: Keep irqs disabled during lockless page table walk
  powerpc: Use ftrace_graph_ret_addr() when unwinding
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc: dump kernel log before carrying out fadump or kdump
  docs: powerpc: Add missing documentation reference
  powerpc/xmon: Fix output of XIVE IPI
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/mm/radix: remove useless kernel messages
  powerpc/fadump: support holes in kernel boot memory area
  powerpc/fadump: remove RMA_START and RMA_END macros
  powerpc/fadump: update documentation about option to release opalcore
  powerpc/fadump: consider f/w load area
  powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
  powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
  powerpc/fadump: improve how crashed kernel's memory is reserved
  powerpc/fadump: consider reserved ranges while releasing memory
  powerpc/fadump: make crash memory ranges array allocation generic
  ...
2019-09-20 11:48:06 -07:00
Linus Torvalds
45979a956b Tracing updates:
- Addition of multiprobes to kprobe and uprobe events
    Allows for more than one probe attached to the same location
 
  - Addition of adding immediates to probe parameters
 
  - Clean up of the recordmcount.c code. This brings us closer
    to merging recordmcount into objtool, and reuse code.
 
  - Other small clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYQoqhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlIxAP9VVABbpuvOYqxKuFgyP62ituSXPLkL
 gZv4I5Zse4b6/gD/eksFXY/OHo7jp6aQiHvxotUkAiFFE9iHzi0JscdMJgo=
 =WqrT
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Addition of multiprobes to kprobe and uprobe events (allows for more
   than one probe attached to the same location)

 - Addition of adding immediates to probe parameters

 - Clean up of the recordmcount.c code. This brings us closer to merging
   recordmcount into objtool, and reuse code.

 - Other small clean ups

* tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  selftests/ftrace: Update kprobe event error testcase
  tracing/probe: Reject exactly same probe event
  tracing/probe: Fix to allow user to enable events on unloaded modules
  selftests/ftrace: Select an existing function in kprobe_eventname test
  tracing/kprobe: Fix NULL pointer access in trace_porbe_unlink()
  tracing: Make sure variable reference alias has correct var_ref_idx
  tracing: Be more clever when dumping hex in __print_hex()
  ftrace: Simplify ftrace hash lookup code in clear_func_from_hash()
  tracing: Add "gfp_t" support in synthetic_events
  tracing: Rename tracing_reset() to tracing_reset_cpu()
  tracing: Document the stack trace algorithm in the comments
  tracing/arm64: Have max stack tracer handle the case of return address after data
  recordmcount: Clarify what cleanup() does
  recordmcount: Remove redundant cleanup() calls
  recordmcount: Kernel style formatting
  recordmcount: Kernel style function signature formatting
  recordmcount: Rewrite error/success handling
  selftests/ftrace: Add syntax error test for multiprobe
  selftests/ftrace: Add syntax error test for immediates
  selftests/ftrace: Add a testcase for kprobe multiprobe event
  ...
2019-09-20 11:19:48 -07:00
Linus Torvalds
3207598ab0 kgdb patches for 5.4-rc1
It has been a quiet dev cycle for kgdb. There has been some good stuff
 for kdb on the mailing list but unfortunately the patches caused a
 couple of problems with the kdb pager so I had to drop those and they
 will have to wait for next time!
 
 That just leaves us with just a couple of very tiny clean ups for now:
 
  * Fix a broken comment
  * Use str_has_prefix() for the grep "pipe" in kdb
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl2Do/AACgkQfOMlXTn3
 iKF8Lw//cpxBlzZlBpkWDS7Uiuvsrzy8M7vZndYNsrnY89kjYOm0JynIT+8CZFXk
 tLKWyL3OZxUTWUvouTSeqyJAbLJYOB3wBPCMy7aVg7UAaoeEvRzN/xipM21sa4Hm
 1eVP51EMuNTaLzn4jgZV5Ad+ruSnj8Ua3TxK5qc2wYIuJvULnVzsSQxKcQopn2tk
 LqIEwo0wvPVkb865z3D0KAOxqkftJHa4RdwNAyDy6xCQH/RlHR06NshKEN1g3j0I
 0yKd/LBW7Z5yBMFpf1EOv3euF0URu9cmDP4o+n5Tyx71UbZ++NqiPFAAV2zWZq42
 3sCjgLgPm+JIT4aMFEFlJn05xAwJEs3bpglGf1BTN2lFyIZZKD+lBpaAIA7AJtfi
 oc/pJtmy9qAaACVe+l6s4q9YDZjcz9YZ1KZJuAUaVSc/L7AaZRC64CmzoLealZ1x
 bKHvJ0FEKbwVC/JSF7ILKsb0qzyiQFAjuNfnqAX0KKrxOC42IukQy90eDC5JeWhd
 yYAaUjtjeIKVn6wZYyki8dwmK2LPlAbathAQxNDYdYLyteTOeTyuZS7X8V0aLMDN
 VfbOylp9SasZX80vHjhI+gZv523ljDj/Ec0o1kHaswpAELU+oVbdp91vuSvi4lPn
 MmrNS2o3QZMlGMvLupSqmp+nQHZFmhoTWbxa84gYptPxwk2vipI=
 =oIVA
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "It has been a quiet dev cycle for kgdb. There has been some good stuff
  for kdb on the mailing list but unfortunately the patches caused a
  couple of problems with the kdb pager so I had to drop those and they
  will have to wait for next time!

  That just leaves us with just a couple of very tiny clean ups for now:

   - Fix a broken comment

   - Use str_has_prefix() for the grep "pipe" in kdb"

* tag 'kgdb-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: fix comment regarding static function
  kdb: Replace strncmp with str_has_prefix
2019-09-20 10:31:19 -07:00
Linus Torvalds
d7b0827f28 Kbuild updates for v5.4
- add modpost warn exported symbols marked as 'static' because 'static'
    and EXPORT_SYMBOL is an odd combination
 
  - break the build early if gold linker is used
 
  - optimize the Bison rule to produce .c and .h files by a single
    pattern rule
 
  - handle PREEMPT_RT in the module vermagic and UTS_VERSION
 
  - warn CONFIG options leaked to the user-space except existing ones
 
  - make single targets work properly
 
  - rebuild modules when module linker scripts are updated
 
  - split the module final link stage into scripts/Makefile.modfinal
 
  - fix the missed error code in merge_config.sh
 
  - improve the error message displayed on the attempt of the O= build
    in unclean source tree
 
  - remove 'clean-dirs' syntax
 
  - disable -Wimplicit-fallthrough warning for Clang
 
  - add CONFIG_CC_OPTIMIZE_FOR_SIZE_O3 for ARC
 
  - remove ARCH_{CPP,A,C}FLAGS variables
 
  - add $(BASH) to run bash scripts
 
  - change *CFLAGS_<basetarget>.o to take the relative path to $(obj)
    instead of the basename
 
  - stop suppressing Clang's -Wunused-function warnings when W=1
 
  - fix linux/export.h to avoid genksyms calculating CRC of trimmed
    exported symbols
 
  - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl1+OnoeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGoKEQAKcid9lDacMe5KWT
 4Ic93hANMFKZ9Qy8WoxivnOr1a93NcloZ0Bhka96QUt7hYUkLmDCs99eMbxKuMfP
 m/ViHepojOBPzq+VtAGWOiIyPMCA7XDrTPph4wcPDKeOURTreK1PZ20fxDoAR4to
 +qaqKZJGdRcNf2DpJN1yIosz8Wj0Sa2LQrRi9jgUHi3bzgvLfL7P9WM2xyZMggAc
 GaSktCEFL0UzMFlMpYyDrKh2EV6ryOnN8+bVAKbmWP89tuU3njutycKdWOoL+bsj
 tH2kjFThxQyIcZGNHS1VzNunYAFE2q5nj2q47O1EDN6sjTYUoRn5cHwPam6x3Kly
 NH88xDEtJ7sUUc9GZEIXADWWD0f08QIhAH5x+jxFg3529lNgyrNHRSQ2XceYNAnG
 i/GnMJ0EhODOFKusXw7sNlWFKtukep+8/pwnvfTXWQu6plEm5EQ3a3RL5SESubVo
 mHzXsQDFCE0x/UrsJxEAww+3YO3pQEelfVi74W9z0cckpbRF8FuUq/69ltOT15l4
 X+gCz80lXMWBKw/kNoR4GQoAJo3KboMEociawwoj72HXEHTPLJnCdUOsAf3n+opj
 xuz/UPZ4WYSgKdnbmmDbJ+1POA1NqtARZZXpMVyKVVCOiLafbJkLQYwLKEpE2mOO
 TP9igzP1i3/jPWec8cJ6Fa8UwuGh
 =VGqV
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - add modpost warn exported symbols marked as 'static' because 'static'
   and EXPORT_SYMBOL is an odd combination

 - break the build early if gold linker is used

 - optimize the Bison rule to produce .c and .h files by a single
   pattern rule

 - handle PREEMPT_RT in the module vermagic and UTS_VERSION

 - warn CONFIG options leaked to the user-space except existing ones

 - make single targets work properly

 - rebuild modules when module linker scripts are updated

 - split the module final link stage into scripts/Makefile.modfinal

 - fix the missed error code in merge_config.sh

 - improve the error message displayed on the attempt of the O= build in
   unclean source tree

 - remove 'clean-dirs' syntax

 - disable -Wimplicit-fallthrough warning for Clang

 - add CONFIG_CC_OPTIMIZE_FOR_SIZE_O3 for ARC

 - remove ARCH_{CPP,A,C}FLAGS variables

 - add $(BASH) to run bash scripts

 - change *CFLAGS_<basetarget>.o to take the relative path to $(obj)
   instead of the basename

 - stop suppressing Clang's -Wunused-function warnings when W=1

 - fix linux/export.h to avoid genksyms calculating CRC of trimmed
   exported symbols

 - misc cleanups

* tag 'kbuild-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (63 commits)
  genksyms: convert to SPDX License Identifier for lex.l and parse.y
  modpost: use __section in the output to *.mod.c
  modpost: use MODULE_INFO() for __module_depends
  export.h, genksyms: do not make genksyms calculate CRC of trimmed symbols
  export.h: remove defined(__KERNEL__), which is no longer needed
  kbuild: allow Clang to find unused static inline functions for W=1 build
  kbuild: rename KBUILD_ENABLE_EXTRA_GCC_CHECKS to KBUILD_EXTRA_WARN
  kbuild: refactor scripts/Makefile.extrawarn
  merge_config.sh: ignore unwanted grep errors
  kbuild: change *FLAGS_<basetarget>.o to take the path relative to $(obj)
  modpost: add NOFAIL to strndup
  modpost: add guid_t type definition
  kbuild: add $(BASH) to run scripts with bash-extension
  kbuild: remove ARCH_{CPP,A,C}FLAGS
  kbuild,arc: add CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 for ARC
  kbuild: Do not enable -Wimplicit-fallthrough for clang for now
  kbuild: clean up subdir-ymn calculation in Makefile.clean
  kbuild: remove unneeded '+' marker from cmd_clean
  kbuild: remove clean-dirs syntax
  kbuild: check clean srctree even earlier
  ...
2019-09-20 08:36:47 -07:00
Linus Torvalds
671df18953 dma-mapping updates for 5.4:
- add dma-mapping and block layer helpers to take care of IOMMU
    merging for mmc plus subsequent fixups (Yoshihiro Shimoda)
  - rework handling of the pgprot bits for remapping (me)
  - take care of the dma direct infrastructure for swiotlb-xen (me)
  - improve the dma noncoherent remapping infrastructure (me)
  - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me)
  - cleanup mmaping of coherent DMA allocations (me)
  - various misc cleanups (Andy Shevchenko, me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2CSucLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPfrhAAgXZA/EdFPvkkCoDrmgtf3XkudX9gajeCd9g4NZy6
 ZBQElTVvm4S0sQj7IXgALnMumDMbbTibW5SQLX5GwQDe+XXBpZ8ajpAnJAXc8a5T
 qaFQ4SInr4CgBZf9nZKDkbSBZ1Tu3AQm1c0QI8riRCkrVTuX4L06xpCef4Yh4mgO
 rwWEjIioYpQiKZMmu98riXh3ZNfFG3mVJRhKt8B6XJbBgnUnjDOPYGgaUwp6CU20
 tFBKL2GaaV0vdLJ5wYhIGXT4DJ8tp9T5n3IYGZv1Ux889RaZEHlCrMxzelYeDbCT
 KhZbhcSECGnddsh73t/UX7/KhytuqnfKa9n+Xo6AWuA47xO4c36quOOcTk9M0vE5
 TfGDmewgL6WIv4lzokpRn5EkfDhyL33j8eYJrJ8e0ldcOhSQIFk4ciXnf2stWi6O
 JrlzzzSid+zXxu48iTfoPdnMr7psTpiMvvRvKfEeMp2FX9Fg6EdMzJYLTEl+COHB
 0WwNacZmY3P01+b5EZXEgqKEZevIIdmPKbyM9rPtTjz8BjBwkABHTpN3fWbVBf7/
 Ax6OPYyW40xp1fnJuzn89m3pdOxn88FpDdOaeLz892Zd+Qpnro1ayulnFspVtqGM
 mGbzA9whILvXNRpWBSQrvr2IjqMRjbBxX3BVACl3MMpOChgkpp5iANNfSDjCftSF
 Zu8=
 =/wGv
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - add dma-mapping and block layer helpers to take care of IOMMU merging
   for mmc plus subsequent fixups (Yoshihiro Shimoda)

 - rework handling of the pgprot bits for remapping (me)

 - take care of the dma direct infrastructure for swiotlb-xen (me)

 - improve the dma noncoherent remapping infrastructure (me)

 - better defaults for ->mmap, ->get_sgtable and ->get_required_mask
   (me)

 - cleanup mmaping of coherent DMA allocations (me)

 - various misc cleanups (Andy Shevchenko, me)

* tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits)
  mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE
  mmc: queue: Fix bigger segments usage
  arm64: use asm-generic/dma-mapping.h
  swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page
  swiotlb-xen: simplify cache maintainance
  swiotlb-xen: use the same foreign page check everywhere
  swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable
  xen: remove the exports for xen_{create,destroy}_contiguous_region
  xen/arm: remove xen_dma_ops
  xen/arm: simplify dma_cache_maint
  xen/arm: use dev_is_dma_coherent
  xen/arm: consolidate page-coherent.h
  xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance
  arm: remove wrappers for the generic dma remap helpers
  dma-mapping: introduce a dma_common_find_pages helper
  dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
  vmalloc: lift the arm flag for coherent mappings to common code
  dma-mapping: provide a better default ->get_required_mask
  dma-mapping: remove the dma_declare_coherent_memory export
  remoteproc: don't allow modular build
  ...
2019-09-19 13:27:23 -07:00
Li RongQing
e430d802d6 timer: Read jiffies once when forwarding base clk
The timer delayed for more than 3 seconds warning was triggered during
testing.

  Workqueue: events_unbound sched_tick_remote
  RIP: 0010:sched_tick_remote+0xee/0x100
  ...
  Call Trace:
   process_one_work+0x18c/0x3a0
   worker_thread+0x30/0x380
   kthread+0x113/0x130
   ret_from_fork+0x22/0x40

The reason is that the code in collect_expired_timers() uses jiffies
unprotected:

    if (next_event > jiffies)
        base->clk = jiffies;

As the compiler is allowed to reload the value base->clk can advance
between the check and the store and in the worst case advance farther than
next event. That causes the timer expiry to be delayed until the wheel
pointer wraps around.

Convert the code to use READ_ONCE()

Fixes: 236968383c ("timers: Optimize collect_expired_timers() for NOHZ")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1568894687-14499-1-git-send-email-lirongqing@baidu.com
2019-09-19 17:50:11 +02:00
Masami Hiramatsu
fe60b0ce8e tracing/probe: Reject exactly same probe event
Reject exactly same probe events as existing probes.

Multiprobe allows user to define multiple probes on same
event. If user appends a probe which exactly same definition
(same probe address and same arguments) on existing event,
the event will record same probe information twice.
That can be confusing users, so reject it.

Link: http://lkml.kernel.org/r/156879694602.31056.5533024778165036763.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-19 11:09:16 -04:00
Masami Hiramatsu
44d00dc7ce tracing/probe: Fix to allow user to enable events on unloaded modules
Fix to allow user to enable probe events on unloaded modules.

This operations was allowed before commit 60d53e2c3b ("tracing/probe:
Split trace_event related data from trace_probe"), because if users
need to probe module init functions, they have to enable those probe
events before loading module.

Link: http://lkml.kernel.org/r/156879693733.31056.9331322616994665167.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-19 09:55:42 -04:00
Alexei Starovoitov
9eea984979 bpf: fix BTF verification of enums
vmlinux BTF has enums that are 8 byte and 1 byte in size.
2 byte enum is a valid construct as well.
Fix BTF enum verification to accept those sizes.

Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-19 14:22:44 +02:00
David Howells
d2935de7e4 vfs: Convert bpf to use the new mount API
Convert the bpf filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexei Starovoitov <ast@kernel.org>
cc: Daniel Borkmann <daniel@iogearbox.net>
cc: Martin KaFai Lau <kafai@fb.com>
cc: Song Liu <songliubraving@fb.com>
cc: Yonghong Song <yhs@fb.com>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-18 22:35:31 -04:00
Tejun Heo
def98c84b6 workqueue: Fix spurious sanity check failures in destroy_workqueue()
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle.  If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling.  It unfortunately
has a couple issues.

* Mayday list queueing increments pwq's refcnts which gets detected as
  busy and fails the sanity checks.  However, because mayday list
  queueing is asynchronous, this condition can happen without any
  actual work items left in the workqueue.

* Sanity check failure leaves the sysfs interface behind too which can
  lead to init failure of newer instances of the workqueue.

This patch fixes the above two by

* If a workqueue has a rescuer, disable and kill the rescuer before
  sanity checks.  Disabling and killing is guaranteed to flush the
  existing mayday list.

* Remove sysfs interface before sanity checks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
2019-09-18 18:45:23 -07:00
Linus Torvalds
81160dda9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.

 2) Use bio_vec in the networking instead of custom skb_frag_t, from
    Matthew Wilcox.

 3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.

 4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.

 5) Support all variants of 5750X bnxt_en chips, from Michael Chan.

 6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
    Buslov.

 7) Add TCP syn cookies bpf helper, from Petar Penkov.

 8) Add 'nettest' to selftests and use it, from David Ahern.

 9) Add extack support to drop_monitor, add packet alert mode and
    support for HW drops, from Ido Schimmel.

10) Add VLAN offload to stmmac, from Jose Abreu.

11) Lots of devm_platform_ioremap_resource() conversions, from
    YueHaibing.

12) Add IONIC driver, from Shannon Nelson.

13) Several kTLS cleanups, from Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
  mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
  mlxsw: spectrum: Register CPU port with devlink
  mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
  net: ena: fix incorrect update of intr_delay_resolution
  net: ena: fix retrieval of nonadaptive interrupt moderation intervals
  net: ena: fix update of interrupt moderation register
  net: ena: remove all old adaptive rx interrupt moderation code from ena_com
  net: ena: remove ena_restore_ethtool_params() and relevant fields
  net: ena: remove old adaptive interrupt moderation code from ena_netdev
  net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
  net: ena: enable the interrupt_moderation in driver_supported_features
  net: ena: reimplement set/get_coalesce()
  net: ena: switch to dim algorithm for rx adaptive interrupt moderation
  net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
  net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
  ethtool: implement Energy Detect Powerdown support via phy-tunable
  xen-netfront: do not assume sk_buff_head list is empty in error handling
  s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
  net: ena: don't wake up tx queue when down
  drop_monitor: Better sanitize notified packets
  ...
2019-09-18 12:34:53 -07:00
Linus Torvalds
8b53c76533 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add the ability to abort a skcipher walk.

  Algorithms:
   - Fix XTS to actually do the stealing.
   - Add library helpers for AES and DES for single-block users.
   - Add library helpers for SHA256.
   - Add new DES key verification helper.
   - Add surrounding bits for ESSIV generator.
   - Add accelerations for aegis128.
   - Add test vectors for lzo-rle.

  Drivers:
   - Add i.MX8MQ support to caam.
   - Add gcm/ccm/cfb/ofb aes support in inside-secure.
   - Add ofb/cfb aes support in media-tek.
   - Add HiSilicon ZIP accelerator support.

  Others:
   - Fix potential race condition in padata.
   - Use unbound workqueues in padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
  crypto: caam - Cast to long first before pointer conversion
  crypto: ccree - enable CTS support in AES-XTS
  crypto: inside-secure - Probe transform record cache RAM sizes
  crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
  crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
  crypto: inside-secure - Enable extended algorithms on newer HW
  crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
  crypto: inside-secure - Add EIP97/EIP197 and endianness detection
  padata: remove cpu_index from the parallel_queue
  padata: unbind parallel jobs from specific CPUs
  padata: use separate workqueues for parallel and serial work
  padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  crypto: pcrypt - remove padata cpumask notifier
  padata: make padata_do_parallel find alternate callback CPU
  workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  padata: allocate workqueue internally
  arm64: dts: imx8mq: Add CAAM node
  random: Use wait_event_freezable() in add_hwgenerator_randomness()
  crypto: ux500 - Fix COMPILE_TEST warnings
  ...
2019-09-18 12:11:14 -07:00
Naveen N. Rao
a3db31ff6c ftrace: Look up the address of return_to_handler() using helpers
This ensures that we use the right address on architectures that use
function descriptors.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8f6f14d192a994008ac370ce14036bbe67224c7d.1567707399.git.naveen.n.rao@linux.vnet.ibm.com
2019-09-18 12:24:47 +10:00
Linus Torvalds
77dcfe2b9e Power management updates for 5.4-rc1
- Rework the main suspend-to-idle control flow to avoid repeating
    "noirq" device resume and suspend operations in case of spurious
    wakeups from the ACPI EC and decouple the ACPI EC wakeups support
    from the LPS0 _DSM support (Rafael Wysocki).
 
  - Extend the wakeup sources framework to expose wakeup sources as
    device objects in sysfs (Tri Vo, Stephen Boyd).
 
  - Expose system suspend statistics in sysfs (Kalesh Singh).
 
  - Introduce a new haltpoll cpuidle driver and a new matching
    governor for virtualized guests wanting to do guest-side polling
    in the idle loop (Marcelo Tosatti, Joao Martins, Wanpeng Li,
    Stephen Rothwell).
 
  - Fix the menu and teo cpuidle governors to allow the scheduler tick
    to be stopped if PM QoS is used to limit the CPU idle state exit
    latency in some cases (Rafael Wysocki).
 
  - Increase the resolution of the play_idle() argument to microseconds
    for more fine-grained injection of CPU idle cycles (Daniel Lezcano).
 
  - Switch over some users of cpuidle notifiers to the new QoS-based
    frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
    policy notifier events (Viresh Kumar).
 
  - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).
 
  - Add support for MT8183 and MT8516 to the mediatek cpufreq driver
    (Andrew-sh.Cheng, Fabien Parent).
 
  - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
    Huang).
 
  - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).
 
  - Update the qcom cpufreq driver (among other things, to make it
    easier to extend and to use kryo cpufreq for other nvmem-based
    SoCs) and add qcs404 support to it  (Niklas Cassel, Douglas
    RAILLARD, Sibi Sankar, Sricharan R).
 
  - Fix assorted issues and make assorted minor improvements in the
    cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
    Gustavo Silva, Hariprasad Kelam).
 
  - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
    Bergmann).
 
  - Add new Exynos PPMU events to devfreq events and extend that
    mechanism (Lukasz Luba).
 
  - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).
 
  - Improve devfreq documentation and governor code, fix spelling
    typos in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard
    Crestez, MyungJoo Ham, Gaël PORTAY).
 
  - Add regulators enable and disable to the OPP (operating performance
    points) framework (Kamil Konieczny).
 
  - Update the OPP framework to support multiple opp-suspend properties
    (Anson Huang).
 
  - Fix assorted issues and make assorted minor improvements in the OPP
    code (Niklas Cassel, Viresh Kumar, Yue Hu).
 
  - Clean up the generic power domains (genpd) framework (Ulf Hansson).
 
  - Clean up assorted pieces of power management code and documentation
    (Akinobu Mita, Amit Kucheria, Chuhong Yuan).
 
  - Update the pm-graph tool to version 5.5 including multiple fixes
    and improvements (Todd Brandt).
 
  - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
    Sébastien Szymanski).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2ArZ4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxgfYQAK80hs43vWQDmp7XKrN4pQe8+qYULAGO
 fBfrFl+NG9y/cnuqnt3NtA8MoyNsMMkMLkpkEDMfSbYqqH5ehEzX5+uGJWiWx8+Y
 oH5KU8MH7Tj/utYaalGzDt0AHfHZDIGC0NCUNQJVtE/4mOANFabwsCwscp4MrD5Q
 WjFN8U4BrsmWgJdZ/U9QIWcDZ0I+1etCF+rZG2yxSv31FMq2Zk/Qm4YyobqCvQFl
 TR9rxl08wqUmIYIz5cDjt/3AKH7NLLDqOTstbCL7cmufM5XPFc1yox69xc89UrIa
 4AMgmDp7SMwFG/gdUPof0WQNmx7qxmiRAPleAOYBOZW/8jPNZk2y+RhM5NeF72m7
 AFqYiuxqatkSb4IsT8fLzH9IUZOdYr8uSmoMQECw+MHdApaKFjFV8Lb/qx5+AwkD
 y7pwys8dZSamAjAf62eUzJDWcEwkNrujIisGrIXrVHb7ISbweskMOmdAYn9p4KgP
 dfRzpJBJ45IaMIdbaVXNpg3rP7Apfs7X1X+/ZhG6f+zHH3zYwr8Y81WPqX8WaZJ4
 qoVCyxiVWzMYjY2/1lzjaAdqWojPWHQ3or3eBaK52DouyG3jY6hCDTLwU7iuqcCX
 jzAtrnqrNIKufvaObEmqcmYlIIOFT7QaJCtGUSRFQLfSon8fsVSR7LLeXoAMUJKT
 JWQenuNaJngK
 =TBDQ
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These include a rework of the main suspend-to-idle code flow (related
  to the handling of spurious wakeups), a switch over of several users
  of cpufreq notifiers to QoS-based limits, a new devfreq driver for
  Tegra20, a new cpuidle driver and governor for virtualized guests, an
  extension of the wakeup sources framework to expose wakeup sources as
  device objects in sysfs, and more.

  Specifics:

   - Rework the main suspend-to-idle control flow to avoid repeating
     "noirq" device resume and suspend operations in case of spurious
     wakeups from the ACPI EC and decouple the ACPI EC wakeups support
     from the LPS0 _DSM support (Rafael Wysocki).

   - Extend the wakeup sources framework to expose wakeup sources as
     device objects in sysfs (Tri Vo, Stephen Boyd).

   - Expose system suspend statistics in sysfs (Kalesh Singh).

   - Introduce a new haltpoll cpuidle driver and a new matching governor
     for virtualized guests wanting to do guest-side polling in the idle
     loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell).

   - Fix the menu and teo cpuidle governors to allow the scheduler tick
     to be stopped if PM QoS is used to limit the CPU idle state exit
     latency in some cases (Rafael Wysocki).

   - Increase the resolution of the play_idle() argument to microseconds
     for more fine-grained injection of CPU idle cycles (Daniel
     Lezcano).

   - Switch over some users of cpuidle notifiers to the new QoS-based
     frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
     policy notifier events (Viresh Kumar).

   - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).

   - Add support for MT8183 and MT8516 to the mediatek cpufreq driver
     (Andrew-sh.Cheng, Fabien Parent).

   - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
     Huang).

   - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).

   - Update the qcom cpufreq driver (among other things, to make it
     easier to extend and to use kryo cpufreq for other nvmem-based
     SoCs) and add qcs404 support to it (Niklas Cassel, Douglas
     RAILLARD, Sibi Sankar, Sricharan R).

   - Fix assorted issues and make assorted minor improvements in the
     cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
     Gustavo Silva, Hariprasad Kelam).

   - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
     Bergmann).

   - Add new Exynos PPMU events to devfreq events and extend that
     mechanism (Lukasz Luba).

   - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).

   - Improve devfreq documentation and governor code, fix spelling typos
     in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez,
     MyungJoo Ham, Gaël PORTAY).

   - Add regulators enable and disable to the OPP (operating performance
     points) framework (Kamil Konieczny).

   - Update the OPP framework to support multiple opp-suspend properties
     (Anson Huang).

   - Fix assorted issues and make assorted minor improvements in the OPP
     code (Niklas Cassel, Viresh Kumar, Yue Hu).

   - Clean up the generic power domains (genpd) framework (Ulf Hansson).

   - Clean up assorted pieces of power management code and documentation
     (Akinobu Mita, Amit Kucheria, Chuhong Yuan).

   - Update the pm-graph tool to version 5.5 including multiple fixes
     and improvements (Todd Brandt).

   - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
     Sébastien Szymanski)"

* tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits)
  cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
  cpuidle-haltpoll: do not set an owner to allow modunload
  cpuidle-haltpoll: return -ENODEV on modinit failure
  cpuidle-haltpoll: set haltpoll as preferred governor
  cpuidle: allow governor switch on cpuidle_register_driver()
  PM: runtime: Documentation: add runtime_status ABI document
  pm-graph: make setVal unbuffered again for python2 and python3
  powercap: idle_inject: Use higher resolution for idle injection
  cpuidle: play_idle: Increase the resolution to usec
  cpuidle-haltpoll: vcpu hotplug support
  cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
  cpufreq: qcom: Add support for qcs404 on nvmem driver
  cpufreq: qcom: Refactor the driver to make it easier to extend
  cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
  dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
  dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
  Documentation: cpufreq: Update policy notifier documentation
  cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
  PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
  PM / Domains: Simplify genpd_lookup_dev()
  ...
2019-09-17 19:15:14 -07:00
Linus Torvalds
3ee8d6c592 Merge branch 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Three minor cleanup patches"

* 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Use kvmalloc in cgroups-v1
  cgroup: minor tweak for logic to get cgroup css
  cgroup: Replace a seq_printf() call by seq_puts() in cgroup_print_ss_mask()
2019-09-17 15:57:22 -07:00
Linus Torvalds
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Linus Torvalds
a572ba6329 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq updates from Thomas Gleixner:
 "Updates from the irq departement:

   - Update the interrupt spreading code so it handles numa node with
     different CPU counts properly.

   - A large overhaul of the ARM GiCv3 driver to support new PPI and SPI
     ranges.

   - Conversion of all alloc_fwnode() users to use physical addresses
     instead of virtual addresses so the virtual addresses are not
     leaked. The physical address is sufficient to identify the
     associated interrupt chip.

   - Add support for Marvel MMP3, Amlogic Meson SM1 interrupt chips.

   - Enforce interrupt threading at compile time if RT is enabled.

   - Small updates and improvements all over the place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices
  irqchip/uniphier-aidet: Use devm_platform_ioremap_resource()
  irqdomain: Add the missing assignment of domain->fwnode for named fwnode
  irqchip/mmp: Coexist with GIC root IRQ controller
  irqchip/mmp: Mask off interrupts from other cores
  irqchip/mmp: Add missing chained_irq_{enter,exit}()
  irqchip/mmp: Do not use of_address_to_resource() to get mux regs
  irqchip/meson-gpio: Add support for meson sm1 SoCs
  dt-bindings: interrupt-controller: New binding for the meson sm1 SoCs
  genirq/affinity: Remove const qualifier from node_to_cpumask argument
  genirq/affinity: Spread vectors on node according to nr_cpu ratio
  genirq/affinity: Improve __irq_build_affinity_masks()
  irqchip: Remove dev_err() usage after platform_get_irq()
  irqchip: Add include guard to irq-partition-percpu.h
  irqchip/mmp: Do not call irq_set_default_host() on DT platforms
  irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map
  irqchip/gic-v3: Add quirks for HIP06/07 invalid GICD_TYPER erratum 161010803
  irqchip/gic: Skip DT quirks when evaluating IIDR-based quirks
  irqchip/gic-v3: Warn about inconsistent implementations of extended ranges
  irqchip/gic-v3: Add EPPI range support
  ...
2019-09-17 11:42:15 -07:00
Linus Torvalds
3cd0462230 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "A small update for the SMP hotplug code code:

   - Track "booted once" CPUs in a cpumask so the x86 APIC code has an
     easy way to decide whether broadcast IPIs are safe to use or not.

   - Implement a cpumask_or_equal() helper for the IPI broadcast
     evaluation.

     The above two changes have been also pulled into the x86/apic
     branch for implementing the conditional IPI broadcast feature.

   - Cache the number of online CPUs instead of reevaluating it over and
     over. num_online_cpus() is an unreliable snapshot anyway except
     when it is used outside a cpu hotplug locked region. The cached
     access is not changing this, but it's definitely faster than
     calculating the bitmap wheight especially in hot paths"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Cache number of online CPUs
  cpumask: Implement cpumask_or_equal()
  smp/hotplug: Track booted once CPUs in a cpumask
2019-09-17 10:32:05 -07:00
Linus Torvalds
16208cd6c3 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single fix to prevent the alarm timer code from returning ENOTSUPP
  to user space.

  ENOTSUPP is a purely kernel internal error code related to NFSv3 and
  should never be handed back to user space. The risk for ABI breakage
  is low as the number of systems which do not have a working RTC is
  very limited"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP
2019-09-17 10:12:41 -07:00
Masami Hiramatsu
d59fae6fea tracing/kprobe: Fix NULL pointer access in trace_porbe_unlink()
Fix NULL pointer access in trace_probe_unlink() by initializing
trace_probe.list correctly in trace_probe_init().

In the error case of trace_probe_init(), it can call trace_probe_unlink()
before initializing trace_probe.list member. This causes NULL pointer
dereference at list_del_init() in trace_probe_unlink().

Syzbot reported :

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8633 Comm: syz-executor797 Not tainted 5.3.0-rc8-next-20190915
#0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__list_del_entry_valid+0x85/0xf5 lib/list_debug.c:51
Code: 0f 84 e1 00 00 00 48 b8 22 01 00 00 00 00 ad de 49 39 c4 0f 84 e2 00
00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 75
53 49 8b 14 24 4c 39 f2 0f 85 99 00 00 00 49 8d 7d
RSP: 0018:ffff888090a7f9d8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff88809b6f90c0 RCX: ffffffff817c0ca9
RDX: 0000000000000000 RSI: ffffffff817c0a73 RDI: ffff88809b6f90c8
RBP: ffff888090a7f9f0 R08: ffff88809a04e600 R09: ffffed1015d26aed
R10: ffffed1015d26aec R11: ffff8880ae935763 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88809b6f90c0 R15: ffff88809b6f90d0
FS:  0000555556f99880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006cc090 CR3: 00000000962b2000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_del_entry include/linux/list.h:131 [inline]
  list_del_init include/linux/list.h:190 [inline]
  trace_probe_unlink+0x1f/0x200 kernel/trace/trace_probe.c:959
  trace_probe_cleanup+0xd3/0x110 kernel/trace/trace_probe.c:973
  trace_probe_init+0x3f2/0x510 kernel/trace/trace_probe.c:1011
  alloc_trace_uprobe+0x5e/0x250 kernel/trace/trace_uprobe.c:353
  create_local_trace_uprobe+0x109/0x4a0 kernel/trace/trace_uprobe.c:1508
  perf_uprobe_init+0x131/0x210 kernel/trace/trace_event_perf.c:314
  perf_uprobe_event_init+0x106/0x1a0 kernel/events/core.c:8898
  perf_try_init_event+0x135/0x590 kernel/events/core.c:10184
  perf_init_event kernel/events/core.c:10228 [inline]
  perf_event_alloc.part.0+0x1b89/0x33d0 kernel/events/core.c:10505
  perf_event_alloc kernel/events/core.c:10887 [inline]
  __do_sys_perf_event_open+0xa2d/0x2d00 kernel/events/core.c:10989
  __se_sys_perf_event_open kernel/events/core.c:10871 [inline]
  __x64_sys_perf_event_open+0xbe/0x150 kernel/events/core.c:10871
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/156869709721.22406.5153754822203046939.stgit@devnote2

Reported-by: syzbot+2f807f4d3a2a4e87f18f@syzkaller.appspotmail.com
Fixes: ca89bc071d ("tracing/kprobe: Add multi-probe per event support")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:29 -04:00
Tom Zanussi
17f8607a16 tracing: Make sure variable reference alias has correct var_ref_idx
Original changelog from Steve Rostedt (except last sentence which
explains the problem, and the Fixes: tag):

I performed a three way histogram with the following commands:

echo 'irq_lat u64 lat pid_t pid' > synthetic_events
echo 'wake_lat u64 lat u64 irqlat pid_t pid' >> synthetic_events
echo 'hist:keys=common_pid:irqts=common_timestamp.usecs if function == 0xffffffff81200580' > events/timer/hrtimer_start/trigger
echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$irqts:onmatch(timer.hrtimer_start).irq_lat($lat,pid) if common_flags & 1' > events/sched/sched_waking/trigger
echo 'hist:keys=pid:wakets=common_timestamp.usecs,irqlat=lat' > events/synthetic/irq_lat/trigger
echo 'hist:keys=next_pid:lat=common_timestamp.usecs-$wakets,irqlat=$irqlat:onmatch(synthetic.irq_lat).wake_lat($lat,$irqlat,next_pid)' > events/sched/sched_switch/trigger
echo 1 > events/synthetic/wake_lat/enable

Basically I wanted to see:

 hrtimer_start (calling function tick_sched_timer)

Note:

  # grep tick_sched_timer /proc/kallsyms
ffffffff81200580 t tick_sched_timer

And save the time of that, and then record sched_waking if it is called
in interrupt context and with the same pid as the hrtimer_start, it
will record the latency between that and the waking event.

I then look at when the task that is woken is scheduled in, and record
the latency between the wakeup and the task running.

At the end, the wake_lat synthetic event will show the wakeup to
scheduled latency, as well as the irq latency in from hritmer_start to
the wakeup. The problem is that I found this:

          <idle>-0     [007] d...   190.485261: wake_lat: lat=27 irqlat=190485230 pid=698
          <idle>-0     [005] d...   190.485283: wake_lat: lat=40 irqlat=190485239 pid=10
          <idle>-0     [002] d...   190.488327: wake_lat: lat=56 irqlat=190488266 pid=335
          <idle>-0     [005] d...   190.489330: wake_lat: lat=64 irqlat=190489262 pid=10
          <idle>-0     [003] d...   190.490312: wake_lat: lat=43 irqlat=190490265 pid=77
          <idle>-0     [005] d...   190.493322: wake_lat: lat=54 irqlat=190493262 pid=10
          <idle>-0     [005] d...   190.497305: wake_lat: lat=35 irqlat=190497267 pid=10
          <idle>-0     [005] d...   190.501319: wake_lat: lat=50 irqlat=190501264 pid=10

The irqlat seemed quite large! Investigating this further, if I had
enabled the irq_lat synthetic event, I noticed this:

          <idle>-0     [002] d.s.   249.429308: irq_lat: lat=164968 pid=335
          <idle>-0     [002] d...   249.429369: wake_lat: lat=55 irqlat=249429308 pid=335

Notice that the timestamp of the irq_lat "249.429308" is awfully
similar to the reported irqlat variable. In fact, all instances were
like this. It appeared that:

  irqlat=$irqlat

Wasn't assigning the old $irqlat to the new irqlat variable, but
instead was assigning the $irqts to it.

The issue is that assigning the old $irqlat to the new irqlat variable
creates a variable reference alias, but the alias creation code
forgets to make sure the alias uses the same var_ref_idx to access the
reference.

Link: http://lkml.kernel.org/r/1567375321.5282.12.camel@kernel.org

Cc: Linux Trace Devel <linux-trace-devel@vger.kernel.org>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7e8b88a30b ("tracing: Add hist trigger support for variable reference aliases")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:29 -04:00
Andy Shevchenko
119cdbdb95 tracing: Be more clever when dumping hex in __print_hex()
Hex dump as many as 16 bytes at once in trace_print_hex_seq()
instead of byte-by-byte approach.

Link: http://lkml.kernel.org/r/20190806151543.86061-1-andriy.shevchenko@linux.intel.com

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:28 -04:00
Changbin Du
08468754c1 ftrace: Simplify ftrace hash lookup code in clear_func_from_hash()
Function ftrace_lookup_ip() will check empty hash table. So we don't
need extra check outside.

Link: http://lkml.kernel.org/r/20190910143336.13472-1-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:20 -04:00
Qian Cai
dac9f027b1 sched/fair: Remove unused cfs_rq_clock_task() function
cfs_rq_clock_task() was first introduced and used in:

  f1b17280ef ("sched: Maintain runnable averages across throttled periods")

Over time its use has been graduately removed by the following commits:

  d31b1a66cb ("sched/fair: Factorize PELT update")
  2312729688 ("sched/fair: Update scale invariance of PELT")

Today, there is no single user left, so it can be safely removed.

Found via the -Wunused-function build warning.

Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1568668775-2127-1-git-send-email-cai@lca.pw
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-17 09:55:02 +02:00
Rafael J. Wysocki
fc6763a2d7 Merge branches 'pm-opp', 'pm-qos', 'acpi-pm', 'pm-domains' and 'pm-tools'
* pm-opp:
  PM / OPP: Correct Documentation about library location
  opp: of: Support multiple suspend OPPs defined in DT
  dt-bindings: opp: Support multiple opp-suspend properties
  opp: core: add regulators enable and disable
  opp: Don't decrement uninitialized list_kref

* pm-qos:
  PM: QoS: Get rid of unused flags

* acpi-pm:
  ACPI: PM: Print debug messages on device power state changes

* pm-domains:
  PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
  PM / Domains: Simplify genpd_lookup_dev()
  PM / Domains: Align in-parameter names for some genpd functions

* pm-tools:
  pm-graph: make setVal unbuffered again for python2 and python3
  cpupower: update German translation
  tools/power/cpupower: fix 64bit detection when cross-compiling
  cpupower: Add missing newline at end of file
  pm-graph v5.5
2019-09-17 09:49:19 +02:00
Rafael J. Wysocki
ca61a72ac3 Merge branch 'pm-cpufreq'
* pm-cpufreq: (36 commits)
  cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
  cpufreq: qcom: Add support for qcs404 on nvmem driver
  cpufreq: qcom: Refactor the driver to make it easier to extend
  cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
  dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
  dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
  Documentation: cpufreq: Update policy notifier documentation
  cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
  sched/cpufreq: Align trace event behavior of fast switching
  ACPI: cpufreq: Switch to QoS requests instead of cpufreq notifier
  video: pxafb: Remove cpufreq policy notifier
  video: sa1100fb: Remove cpufreq policy notifier
  arch_topology: Use CPUFREQ_CREATE_POLICY instead of CPUFREQ_NOTIFY
  cpufreq: powerpc_cbe: Switch to QoS requests for freq limits
  cpufreq: powerpc: macintosh: Switch to QoS requests for freq limits
  cpufreq: Print driver name if cpufreq_suspend() fails
  cpufreq: mediatek: Add support for mt8183
  cpufreq: mediatek: change to regulator_get_optional
  cpufreq: imx-cpufreq-dt: Add i.MX8MN support
  cpufreq: Use imx-cpufreq-dt for i.MX8MN's speed grading
  ...
2019-09-17 09:44:29 +02:00
Rafael J. Wysocki
2cdd5cc703 Merge branch 'pm-cpuidle'
* pm-cpuidle:
  cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
  cpuidle-haltpoll: do not set an owner to allow modunload
  cpuidle-haltpoll: return -ENODEV on modinit failure
  cpuidle-haltpoll: set haltpoll as preferred governor
  cpuidle: allow governor switch on cpuidle_register_driver()
  powercap: idle_inject: Use higher resolution for idle injection
  cpuidle: play_idle: Increase the resolution to usec
  cpuidle-haltpoll: vcpu hotplug support
  cpuidle: teo: Get rid of redundant check in teo_update()
  cpuidle: teo: Allow tick to be stopped if PM QoS is used
  cpuidle: menu: Allow tick to be stopped if PM QoS is used
  cpuidle: header file stubs must be "static inline"
  cpuidle-haltpoll: disable host side polling when kvm virtualized
  cpuidle: add haltpoll governor
  governors: unify last_state_idx
  cpuidle: add poll_limit_ns to cpuidle_device structure
  add cpuidle-haltpoll driver
2019-09-17 09:41:26 +02:00
Rafael J. Wysocki
d281706369 Merge branch 'pm-sleep'
* pm-sleep: (29 commits)
  ACPI: PM: s2idle: Always set up EC GPE for system wakeup
  ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily
  PM / wakeup: Unexport wakeup_source_sysfs_{add,remove}()
  PM / wakeup: Register wakeup class kobj after device is added
  PM / wakeup: Fix sysfs registration error path
  PM / wakeup: Show wakeup sources stats in sysfs
  PM / wakeup: Use wakeup_source_register() in wakelock.c
  PM / wakeup: Drop wakeup_source_init(), wakeup_source_prepare()
  PM: sleep: Replace strncmp() with str_has_prefix()
  PM: suspend: Fix platform_suspend_prepare_noirq()
  intel-hid: Disable button array during suspend-to-idle
  intel-hid: intel-vbtn: Avoid leaking wakeup_mode set
  ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
  ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message
  ACPI: EC: PM: Consolidate some code depending on PM_SLEEP
  ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events()
  ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend
  ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter
  ACPI: PM: s2idle: Rearrange lps0_device_attach()
  PM/sleep: Expose suspend stats in sysfs
  ...
2019-09-17 09:36:34 +02:00
Rafael J. Wysocki
1b531e55c5 Merge suspend-to-idle rework material for v5.4.
* pm-s2idle-rework: (21 commits)
  ACPI: PM: s2idle: Always set up EC GPE for system wakeup
  ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily
  PM: suspend: Fix platform_suspend_prepare_noirq()
  intel-hid: Disable button array during suspend-to-idle
  intel-hid: intel-vbtn: Avoid leaking wakeup_mode set
  ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
  ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message
  ACPI: EC: PM: Consolidate some code depending on PM_SLEEP
  ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events()
  ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend
  ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter
  ACPI: PM: s2idle: Rearrange lps0_device_attach()
  ACPI: PM: Set up EC GPE for system wakeup from drivers that need it
  PM: sleep: Drop dpm_noirq_begin() and dpm_noirq_end()
  PM: sleep: Integrate suspend-to-idle with generig suspend flow
  PM: sleep: Simplify suspend-to-idle control flow
  ACPI: PM: Set s2idle_wakeup earlier and clear it later
  PM: sleep: Fix possible overflow in pm_system_cancel_wakeup()
  ACPI: EC: Return bool from acpi_ec_dispatch_gpe()
  ACPICA: Return u32 from acpi_dispatch_gpe()
  ...
2019-09-17 09:35:35 +02:00
Linus Torvalds
22331f8952 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu-feature updates from Ingo Molnar:

 - Rework the Intel model names symbols/macros, which were decades of
   ad-hoc extensions and added random noise. It's now a coherent, easy
   to follow nomenclature.

 - Add new Intel CPU model IDs:
    - "Tiger Lake" desktop and mobile models
    - "Elkhart Lake" model ID
    - and the "Lightning Mountain" variant of Airmont, plus support code

 - Add the new AVX512_VP2INTERSECT instruction to cpufeatures

 - Remove Intel MPX user-visible APIs and the self-tests, because the
   toolchain (gcc) is not supporting it going forward. This is the
   first, lowest-risk phase of MPX removal.

 - Remove X86_FEATURE_MFENCE_RDTSC

 - Various smaller cleanups and fixes

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/cpu: Update init data for new Airmont CPU model
  x86/cpu: Add new Airmont variant to Intel family
  x86/cpu: Add Elkhart Lake to Intel family
  x86/cpu: Add Tiger Lake to Intel family
  x86: Correct misc typos
  x86/intel: Add common OPTDIFFs
  x86/intel: Aggregate microserver naming
  x86/intel: Aggregate big core graphics naming
  x86/intel: Aggregate big core mobile naming
  x86/intel: Aggregate big core client naming
  x86/cpufeature: Explain the macro duplication
  x86/ftrace: Remove mcount() declaration
  x86/PCI: Remove superfluous returns from void functions
  x86/msr-index: Move AMD MSRs where they belong
  x86/cpu: Use constant definitions for CPU models
  lib: Remove redundant ftrace flag removal
  x86/crash: Remove unnecessary comparison
  x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
  x86: Remove X86_FEATURE_MFENCE_RDTSC
  x86/mpx: Remove MPX APIs
  ...
2019-09-16 18:47:53 -07:00
Linus Torvalds
7e67a85999 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
2019-09-16 17:25:49 -07:00
Linus Torvalds
772c1d06bd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Improved kbprobes robustness

   - Intel PEBS support for PT hardware tracing

   - Other Intel PT improvements: high order pages memory footprint
     reduction and various related cleanups

   - Misc cleanups

  The perf tooling side has been very busy in this cycle, with over 300
  commits. This is an incomplete high-level summary of the many
  improvements done by over 30 developers:

   - Lots of updates to the following tools:

      'perf c2c'
      'perf config'
      'perf record'
      'perf report'
      'perf script'
      'perf test'
      'perf top'
      'perf trace'

   - Updates to libperf and libtraceevent, and a consolidation of the
     proliferation of x86 instruction decoder libraries.

   - Vendor event updates for Intel and PowerPC CPUs,

   - Updates to hardware tracing tooling for ARM and Intel CPUs,

   - ... and lots of other changes and cleanups - see the shortlog and
     Git log for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (322 commits)
  kprobes: Prohibit probing on BUG() and WARN() address
  perf/x86: Make more stuff static
  x86, perf: Fix the dependency of the x86 insn decoder selftest
  objtool: Ignore intentional differences for the x86 insn decoder
  objtool: Update sync-check.sh from perf's check-headers.sh
  perf build: Ignore intentional differences for the x86 insn decoder
  perf intel-pt: Use shared x86 insn decoder
  perf intel-pt: Remove inat.c from build dependency list
  perf: Update .gitignore file
  objtool: Move x86 insn decoder to a common location
  perf metricgroup: Support multiple events for metricgroup
  perf metricgroup: Scale the metric result
  perf pmu: Change convert_scale from static to global
  perf symbols: Move mem_info and branch_info out of symbol.h
  perf auxtrace: Uninline functions that touch perf_session
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless thread_map.h include directives
  perf tools: Remove needless thread.h include directives
  perf tools: Remove needless map.h include directives
  ...
2019-09-16 17:06:21 -07:00
Linus Torvalds
c7eba51cfd Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - improve rwsem scalability

 - add uninitialized rwsem debugging check

 - reduce lockdep's stacktrace memory usage and add diagnostics

 - misc cleanups, code consolidation and constification

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mutex: Fix up mutex_waiter usage
  locking/mutex: Use mutex flags macro instead of hard code
  locking/mutex: Make __mutex_owner static to mutex.c
  locking/qspinlock,x86: Clarify virt_spin_lock_key
  locking/rwsem: Check for operations on an uninitialized rwsem
  locking/rwsem: Make handoff writer optimistically spin on owner
  locking/lockdep: Report more stack trace statistics
  locking/lockdep: Reduce space occupied by stack traces
  stacktrace: Constify 'entries' arguments
  locking/lockdep: Make it clear that what lock_class::key points at is not modified
2019-09-16 16:49:55 -07:00
Linus Torvalds
94d18ee934 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycle's RCU changes were:

   - A few more RCU flavor consolidation cleanups.

   - Updates to RCU's list-traversal macros improving lockdep usability.

   - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
     incoming callbacks during grace-period waits.

   - Forward-progress improvements for no-CBs CPUs: Use ->cblist
     structure to take advantage of others' grace periods.

   - Also added a small commit that avoids needlessly inflicting
     scheduler-clock ticks on callback-offloaded CPUs.

   - Forward-progress improvements for no-CBs CPUs: Reduce contention on
     ->nocb_lock guarding ->cblist.

   - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
     list to further reduce contention on ->nocb_lock guarding ->cblist.

   - Miscellaneous fixes.

   - Torture-test updates.

   - minor LKMM updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
  rcu: Don't include <linux/ktime.h> in rcutiny.h
  rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
  rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
  rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
  rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
  rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
  rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
  rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
  rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
  rcu/nocb: Add bypass callback queueing
  rcu/nocb: Atomic ->len field in rcu_segcblist structure
  rcu/nocb: Unconditionally advance and wake for excessive CBs
  rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
  rcu/nocb: Reduce contention at no-CBs invocation-done time
  rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
  rcu/nocb: Round down for number of no-CBs grace-period kthreads
  rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
  rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
  rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
  ...
2019-09-16 16:28:19 -07:00
Linus Torvalds
d0a16fe934 Merge branch 'parisc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - Make the powerpc implementation to read elf files available as a
   public kexec interface so it can be re-used on other architectures
   (Sven)

 - Implement kexec on parisc (Sven)

 - Add kprobes on ftrace on parisc (Sven)

 - Fix kernel crash with HSC-PCI cards based on card-mode Dino

 - Add assembly implementations for memset, strlen, strcpy, strncpy and
   strcat

 - Some cleanups, documentation updates, warning fixes, ...

* 'parisc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (25 commits)
  parisc: Have git ignore generated real2.S and firmware.c
  parisc: Disable HP HSC-PCI Cards to prevent kernel crash
  parisc: add support for kexec_file_load() syscall
  parisc: wire up kexec_file_load syscall
  parisc: add kexec syscall support
  parisc: add __pdc_cpu_rendezvous()
  kprobes/parisc: remove arch_kprobe_on_func_entry()
  kexec_elf: support 32 bit ELF files
  kexec_elf: remove unused variable in kexec_elf_load()
  kexec_elf: remove Elf_Rel macro
  kexec_elf: remove PURGATORY_STACK_SIZE
  kexec_elf: remove parsing of section headers
  kexec_elf: change order of elf_*_to_cpu() functions
  kexec: add KEXEC_ELF
  parisc: Save some bytes in dino driver
  parisc: Drop comments which are already in pci.h
  parisc: Convert eisa_enumerator to use pr_cont()
  parisc: Avoid warning when loading hppb driver
  parisc: speed up flush_tlb_all_local with qemu
  parisc: Add ALTERNATIVE_CODE() and ALT_COND_RUN_ON_QEMU
  ...
2019-09-16 15:38:31 -07:00
Linus Torvalds
76f0f227cf ia64 for v5.4 - big change here is removal of support for SGI Altix
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdf64MAAoJEKurIx+X31iBB20P/07o93sBT92SiA2/ety9sLqV
 BGJmEdw7gyb9WVbUip6s71FIEKZw4foCGkqDiX+lr5Fw2A9tiK7LmFgTLi4LLwg+
 syhYZ1y5/mwBI4FLlJudKjQdFZjr/n7DNlz4H67woE2kK+FyRsOKEaFUhuR8+0rC
 mKJBKtIGnoIOPG06PT1k5qfdpzlreCFoWdIhjO55LfDgZnnDiMaX5h0vcBQ9xgCp
 xGV0n/f7+qn4pzB4hGvNV209Sdgv2V4t77bHNvyXlJrM5Hqzafo5MzFgEJv+fRqJ
 2RnkWVhwctfbid/2ggf2aAsYnMK3GigEaOCsYW2oWJESVUQhxIi3ndF/Jt9fraZv
 ZouD7G/s64P5lUQuCT9JnKGzJrSgxvkd37049AZ4pFVc2MzLC6o6dyyP8pu5ARe8
 T0shFik3+gsml2US/vSUzxvrg1saRQjl9E/AJ0RTZ8oyP4FNnFmkJf38qj3a0L0k
 ILFYscM5q7WPggoDA/m6F96tLGhdK/sKjDzrADjEh2dIvn4woqoEJSDn+rXuP+Gm
 UOj1v8mILZCqvOAmc9IkGCkPUlbrmNV/1FYh5+GWudtillEaD82vjSqm+jnVbfXD
 REvHlR/kxCSj1gg/+nk+NFdZCkW3xETOcTZohhDkR7du2mHjTwBMZ2YRPrqoX4c8
 VZA57Mrqm5Uk5601qYRl
 =L5e+
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull ia64 updates from Tony Luck:
 "The big change here is removal of support for SGI Altix"

* tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits)
  genirq: remove the is_affinity_mask_valid hook
  ia64: remove CONFIG_SWIOTLB ifdefs
  ia64: remove support for machvecs
  ia64: move the screen_info setup to common code
  ia64: move the ROOT_DEV setup to common code
  ia64: rework iommu probing
  ia64: remove the unused sn_coherency_id symbol
  ia64: remove the SGI UV simulator support
  ia64: remove the zx1 swiotlb machvec
  ia64: remove CONFIG_ACPI ifdefs
  ia64: remove CONFIG_PCI ifdefs
  ia64: remove the hpsim platform
  ia64: remove now unused machvec indirections
  ia64: remove support for the SGI SN2 platform
  drivers: remove the SGI SN2 IOC4 base support
  drivers: remove the SGI SN2 IOC3 base support
  qla2xxx: remove SGI SN2 support
  qla1280: remove SGI SN2 support
  misc/sgi-xp: remove SGI SN2 support
  char/mspec: remove SGI SN2 support
  ...
2019-09-16 15:32:01 -07:00
Linus Torvalds
e77fafe9af arm64 updates for 5.4:
- 52-bit virtual addressing in the kernel
 
 - New ABI to allow tagged user pointers to be dereferenced by syscalls
 
 - Early RNG seeding by the bootloader
 
 - Improve robustness of SMP boot
 
 - Fix TLB invalidation in light of recent architectural clarifications
 
 - Support for i.MX8 DDR PMU
 
 - Remove direct LSE instruction patching in favour of static keys
 
 - Function error injection using kprobes
 
 - Support for the PPTT "thread" flag introduced by ACPI 6.3
 
 - Move PSCI idle code into proper cpuidle driver
 
 - Relaxation of implicit I/O memory barriers
 
 - Build with RELR relocations when toolchain supports them
 
 - Numerous cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1yYREQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAM3CAChqDFQkryXoHwdeEcaukMRVNxtxOi4pM4g
 5xqkb7PoqRJssIblsuhaXjrSD97yWCgaqCmFe6rKoes++lP4bFcTe22KXPPyPBED
 A+tK4nTuKKcZfVbEanUjI+ihXaHJmKZ/kwAxWsEBYZ4WCOe3voCiJVNO2fHxqg1M
 8TskZ2BoayTbWMXih0eJg2MCy/xApBq4b3nZG4bKI7Z9UpXiKN1NYtDh98ZEBK4V
 d/oNoHsJ2ZvIQsztoBJMsvr09DTCazCijWZiECadm6l41WEPFizngrACiSJLLtYo
 0qu4qxgg9zgFlvBCRQmIYSggTuv35RgXSfcOwChmW5DUjHG+f9GK
 =Ru4B
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Although there isn't tonnes of code in terms of line count, there are
  a fair few headline features which I've noted both in the tag and also
  in the merge commits when I pulled everything together.

  The part I'm most pleased with is that we had 35 contributors this
  time around, which feels like a big jump from the usual small group of
  core arm64 arch developers. Hopefully they all enjoyed it so much that
  they'll continue to contribute, but we'll see.

  It's probably worth highlighting that we've pulled in a branch from
  the risc-v folks which moves our CPU topology code out to where it can
  be shared with others.

  Summary:

   - 52-bit virtual addressing in the kernel

   - New ABI to allow tagged user pointers to be dereferenced by
     syscalls

   - Early RNG seeding by the bootloader

   - Improve robustness of SMP boot

   - Fix TLB invalidation in light of recent architectural
     clarifications

   - Support for i.MX8 DDR PMU

   - Remove direct LSE instruction patching in favour of static keys

   - Function error injection using kprobes

   - Support for the PPTT "thread" flag introduced by ACPI 6.3

   - Move PSCI idle code into proper cpuidle driver

   - Relaxation of implicit I/O memory barriers

   - Build with RELR relocations when toolchain supports them

   - Numerous cleanups and non-critical fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (114 commits)
  arm64: remove __iounmap
  arm64: atomics: Use K constraint when toolchain appears to support it
  arm64: atomics: Undefine internal macros after use
  arm64: lse: Make ARM64_LSE_ATOMICS depend on JUMP_LABEL
  arm64: asm: Kill 'asm/atomic_arch.h'
  arm64: lse: Remove unused 'alt_lse' assembly macro
  arm64: atomics: Remove atomic_ll_sc compilation unit
  arm64: avoid using hard-coded registers for LSE atomics
  arm64: atomics: avoid out-of-line ll/sc atomics
  arm64: Use correct ll/sc atomic constraints
  jump_label: Don't warn on __exit jump entries
  docs/perf: Add documentation for the i.MX8 DDR PMU
  perf/imx_ddr: Add support for AXI ID filtering
  arm64: kpti: ensure patched kernel text is fetched from PoU
  arm64: fix fixmap copy for 16K pages and 48-bit VA
  perf/smmuv3: Validate groups for global filtering
  perf/smmuv3: Validate group size
  arm64: Relax Documentation/arm64/tagged-pointers.rst
  arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F
  arm64: mm: Ignore spurious translation faults taken from the kernel
  ...
2019-09-16 14:31:40 -07:00
Linus Torvalds
52a5525214 IOMMU Updates for Linux v5.4:
Including:
 
 	- Batched unmap support for the IOMMU-API
 
 	- Support for unlocked command queueing in the ARM-SMMU driver
 
 	- Rework the ATS support in the ARM-SMMU driver
 
 	- More refactoring in the ARM-SMMU driver to support hardware
 	  implemention specific quirks and errata
 
 	- Bounce buffering DMA-API implementatation in the Intel VT-d driver
 	  for untrusted devices (like Thunderbolt devices)
 
 	- Fixes for runtime PM support in the OMAP iommu driver
 
 	- MT8183 IOMMU support in the Mediatek IOMMU driver
 
 	- Rework of the way the IOMMU core sets the default domain type for
 	  groups. Changing the default domain type on x86 does not require two
 	  kernel parameters anymore.
 
 	- More smaller fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl1/pdoACgkQK/BELZcB
 GuMvCw/+K1GPyyZbPWAuXcnclraSZTHXS1lV0yilBXXyT2omFRQpRJYZGN/8NTbE
 SqD2FtzTKGuGSy2jA0drd3RcMKK/zZsFYnJShiM3FHLXatZdaFrnkK7vRHuzKlHf
 dvOlH7gHKtjIPPXodUEb0xd/oRAEIVsKjJyq1fBMARPPAluhU7mIFUI/xbGvX17g
 LM00hIxEhVNsSPemU2kTVISNBPVneecNVLlKXySjp0YPW/+sf8R7tTvwlSXX6h3I
 JM6wOU479O8mBvIcpAjfZlanHCHtqLk0ybaPx666DjdgYx6cUBHbDCF0P57XnGJA
 HNeVGtBwGQb8VWgbPLJKrStSOzYudDG8ndctqfKYN7uiPDjYM2/sqXcwQSVXR9vX
 AjT2s0GFEWT/AJhgBSeg9PJilEX1hPtomGKcQhKfR0wRGycixeZJFbwToQqzJrZN
 7XoORbZPH1S5W6sjXsXH3eVPW3EGnKipulJSPGDqFLa2aIUG+PXuu/A+iJS6sADh
 mqzVfcEs3/NYsro40eA/iQc0t99ftJXgpX18KxYprjyL6VWcwC/xeHcT/Zw9abxz
 r7dYDGTR0z6RIew0GOaeZVdZJh/J6yraKCNDS0ARZgol6JPaU7HGHhh6Ohdmmu8L
 Gtsgdxp4NeLEgp4mQiRvvpQ5pPJ/YR+oCOx3v+PPnKRLhMTxymQ=
 =UF08
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - batched unmap support for the IOMMU-API

 - support for unlocked command queueing in the ARM-SMMU driver

 - rework the ATS support in the ARM-SMMU driver

 - more refactoring in the ARM-SMMU driver to support hardware
   implemention specific quirks and errata

 - bounce buffering DMA-API implementatation in the Intel VT-d driver
   for untrusted devices (like Thunderbolt devices)

 - fixes for runtime PM support in the OMAP iommu driver

 - MT8183 IOMMU support in the Mediatek IOMMU driver

 - rework of the way the IOMMU core sets the default domain type for
   groups. Changing the default domain type on x86 does not require two
   kernel parameters anymore.

 - more smaller fixes and cleanups

* tag 'iommu-updates-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (113 commits)
  iommu/vt-d: Declare Broadwell igfx dmar support snafu
  iommu/vt-d: Add Scalable Mode fault information
  iommu/vt-d: Use bounce buffer for untrusted devices
  iommu/vt-d: Add trace events for device dma map/unmap
  iommu/vt-d: Don't switch off swiotlb if bounce page is used
  iommu/vt-d: Check whether device requires bounce buffer
  swiotlb: Split size parameter to map/unmap APIs
  iommu/omap: Mark pm functions __maybe_unused
  iommu/ipmmu-vmsa: Disable cache snoop transactions on R-Car Gen3
  iommu/ipmmu-vmsa: Move IMTTBCR_SL0_TWOBIT_* to restore sort order
  iommu: Don't use sme_active() in generic code
  iommu/arm-smmu-v3: Fix build error without CONFIG_PCI_ATS
  iommu/qcom: Use struct_size() helper
  iommu: Remove wrong default domain comments
  iommu/dma: Fix for dereferencing before null checking
  iommu/mediatek: Clean up struct mtk_smi_iommu
  memory: mtk-smi: Get rid of need_larbid
  iommu/mediatek: Fix VLD_PA_RNG register backup when suspend
  memory: mtk-smi: Add bus_sel for mt8183
  memory: mtk-smi: Invoke pm runtime_callback to enable clocks
  ...
2019-09-16 14:14:40 -07:00
Linus Torvalds
c17112a5c4 core-process-v5.4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXXe8mQAKCRCRxhvAZXjc
 ou7oAQCszihkNfpjORSSSOqenMDrxxDW++A7TIOLuq7UyZQl8QD+LM1wvT/xypfJ
 ORD9XX8+Wrv07AQn85fZBEFXGrnengk=
 =o+VL
 -----END PGP SIGNATURE-----

Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd/waitid updates from Christian Brauner:
 "This contains two features and various tests.

  First, it adds support for waiting on process through pidfds by adding
  the P_PIDFD type to the waitid() syscall. This completes the basic
  functionality of the pidfd api (cf. [1]). In the meantime we also have
  a new adition to the userspace projects that make use of the pidfd
  api. The qt project was nice enough to send a mail pointing out that
  they have a pr up to switch to the pidfd api (cf. [2]).

  Second, this tag contains an extension to the waitid() syscall to make
  it possible to wait on the current process group in a race free manner
  (even though the actual problem is very unlikely) by specifing 0
  together with the P_PGID type. This extension traces back to a
  discussion on the glibc development mailing list.

  There are also a range of tests for the features above. Additionally,
  the test-suite which detected the pidfd-polling race we fixed in [3]
  is included in this tag"

[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491b ("pidfd: fix a poll race when setting exit_state")

* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  waitid: Add support for waiting for the current process group
  tests: add pidfd poll tests
  tests: move common definitions and functions into pidfd.h
  pidfd: add pidfd_wait tests
  pidfd: add P_PIDFD to waitid()
2019-09-16 09:28:19 -07:00
David S. Miller
28f2c362db Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-09-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Now that initial BPF backend for gcc has been merged upstream, enable
   BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
   bpf_sysctl.file_pos, from Ilya.

2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
   related to recent work on exposing BTF info through sysfs, from Andrii.

3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
   headroom to be added twice, from Ciara.

4) Refactoring work to convert sock opt tests into test_progs framework
   in BPF kselftests, from Stanislav.

5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.

6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 16:02:03 +02:00
Ingo Molnar
563c4f85f9 Merge branch 'sched/rt' into sched/core, to pick up -rt changes
Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-16 14:05:04 +02:00
Petr Mladek
ae88de56a1 Merge branch 'for-5.4' into for-linus 2019-09-16 12:54:25 +02:00
Ilya Leoshkevich
d895a0f16f bpf: fix accessing bpf_sysctl.file_pos on s390
"ctx:file_pos sysctl:read write ok" fails on s390 with "Read value  !=
nux". This is because verifier rewrites a complete 32-bit
bpf_sysctl.file_pos update to a partial update of the first 32 bits of
64-bit *bpf_sysctl_kern.ppos, which is not correct on big-endian
systems.

Fix by using an offset on big-endian systems.

Ditto for bpf_sysctl.file_pos reads. Currently the test does not detect
a problem there, since it expects to see 0, which it gets with high
probability in error cases, so change it to seek to offset 3 and expect
3 in bpf_sysctl.file_pos.

Fixes: e1550bfe0d ("bpf: Add file_pos field to bpf_sysctl ctx")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20190816105300.49035-1-iii@linux.ibm.com/
2019-09-16 11:44:05 +02:00
Toke Høiland-Jørgensen
af58e7ee6a xdp: Fix race in dev_map_hash_update_elem() when replacing element
syzbot found a crash in dev_map_hash_update_elem(), when replacing an
element with a new one. Jesper correctly identified the cause of the crash
as a race condition between the initial lookup in the map (which is done
before taking the lock), and the removal of the old element.

Rather than just add a second lookup into the hashmap after taking the
lock, fix this by reworking the function logic to take the lock before the
initial lookup.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-16 10:19:51 +02:00
Johannes Berg
786b2384bf um: Enable CONFIG_CONSTRUCTORS
We do need to call the constructors for *modules*, and
at least for KASAN in the future, we must call even the
kernel constructors only later when the kernel has been
initialized.

Instead of relying on libc to call them, emit an empty
section for libc and let the kernel's CONSTRUCTORS code
do the rest of the job.

Tested that it indeed doesn't work in modules, and does
work after the fixes in both, with a few functions with
__attribute__((constructor)) in both dynamic and static
builds.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-09-15 21:37:13 +02:00
David S. Miller
aa2eaa8c27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in the btusb and ixgbe drivers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15 14:17:27 +02:00
Linus Torvalds
36024fcf8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't corrupt xfrm_interface parms before validation, from Nicolas
    Dichtel.

 2) Revert use of usb-wakeup in btusb, from Mario Limonciello.

 3) Block ipv6 packets in bridge netfilter if ipv6 is disabled, from
    Leonardo Bras.

 4) IPS_OFFLOAD not honored in ctnetlink, from Pablo Neira Ayuso.

 5) Missing ULP check in sock_map, from John Fastabend.

 6) Fix receive statistic handling in forcedeth, from Zhu Yanjun.

 7) Fix length of SKB allocated in 6pack driver, from Christophe
    JAILLET.

 8) ip6_route_info_create() returns an error pointer, not NULL. From
    Maciej Żenczykowski.

 9) Only add RDS sock to the hashes after rs_transport is set, from
    Ka-Cheong Poon.

10) Don't double clean TX descriptors in ixgbe, from Ilya Maximets.

11) Presence of transmit IPSEC offload in an SKB is not tested for
    correctly in ixgbe and ixgbevf. From Steffen Klassert and Jeff
    Kirsher.

12) Need rcu_barrier() when register_netdevice() takes one of the
    notifier based failure paths, from Subash Abhinov Kasiviswanathan.

13) Fix leak in sctp_do_bind(), from Mao Wenan.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
  cdc_ether: fix rndis support for Mediatek based smartphones
  sctp: destroy bucket if failed to bind addr
  sctp: remove redundant assignment when call sctp_get_port_local
  sctp: change return type of sctp_get_port_local
  ixgbevf: Fix secpath usage for IPsec Tx offload
  sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'
  ixgbe: Fix secpath usage for IPsec TX offload.
  net: qrtr: fix memort leak in qrtr_tun_write_iter
  net: Fix null de-reference of device refcount
  ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()'
  tun: fix use-after-free when register netdev failed
  tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
  ixgbe: fix double clean of Tx descriptors with xdp
  ixgbe: Prevent u8 wrapping of ITR value to something less than 10us
  mlx4: fix spelling mistake "veify" -> "verify"
  net: hns3: fix spelling mistake "undeflow" -> "underflow"
  net: lmc: fix spelling mistake "runnin" -> "running"
  NFC: st95hf: fix spelling mistake "receieve" -> "receive"
  net/rds: An rds_sock is added too early to the hash table
  mac80211: Do not send Layer 2 Update frame before authorization
  ...
2019-09-14 12:20:38 -07:00
Daniel Jordan
c51636a306 padata: remove cpu_index from the parallel_queue
With the removal of the ENODATA case from padata_get_next, the cpu_index
field is no longer useful, so it can go away.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:41 +10:00
Daniel Jordan
bfde23ce20 padata: unbind parallel jobs from specific CPUs
Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job.  Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

So, make the parallel workqueue unbound.

Update the parallel workqueue's cpumask when the instance's parallel
cpumask changes.

Now that parallel jobs no longer run on max_active=1 workqueues, two or
more parallel works that hash to the same CPU may run simultaneously,
finish out of order, and so be serialized out of order.  Prevent this by
keeping the works sorted on the reorder list by sequence number and
checking that in the reordering logic.

padata_get_next becomes padata_find_next so it can be reused for the end
of padata_reorder, where it's used to avoid uselessly queueing work when
the next job by sequence number isn't finished yet but a later job that
hashed to the same CPU has.

The ENODATA case in padata_find_next no longer makes sense because
parallel jobs aren't bound to specific CPUs.  The EINPROGRESS case takes
care of the scenario where a parallel job is potentially running on the
same CPU as padata_find_next, and with only one error code left, just
use NULL instead.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
45d153c08b padata: use separate workqueues for parallel and serial work
padata currently uses one per-CPU workqueue per instance for all work.

Prepare for running parallel jobs on an unbound workqueue by introducing
dedicated workqueues for parallel and serial work.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
cc491d8e64 padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
With pcrypt's cpumask no longer used, take the CPU hotplug lock inside
padata_alloc_possible.

Useful later in the series for avoiding nested acquisition of the CPU
hotplug lock in padata when padata_alloc_possible is allocating an
unbound workqueue.

Without this patch, this nested acquisition would happen later in the
series:

      pcrypt_init_padata
        get_online_cpus
        alloc_padata_possible
          alloc_padata
            alloc_workqueue(WQ_UNBOUND)   // later in the series
              alloc_and_link_pwqs
                apply_wqattrs_lock
                  get_online_cpus         // recursive rwsem acquisition

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
e6ce0e0807 padata: make padata_do_parallel find alternate callback CPU
padata_do_parallel currently returns -EINVAL if the callback CPU isn't
in the callback cpumask.

pcrypt tries to prevent this situation by keeping its own callback
cpumask in sync with padata's and checks that the callback CPU it passes
to padata is valid.  Make padata handle this instead.

padata_do_parallel now takes a pointer to the callback CPU and updates
it for the caller if an alternate CPU is used.  Overall behavior in
terms of which callback CPUs are chosen stays the same.

Prepares for removal of the padata cpumask notifier in pcrypt, which
will fix a lockdep complaint about nested acquisition of the CPU hotplug
lock later in the series.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
509b320489 workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.

Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
513c98d086 workqueue: unconfine alloc/apply/free_workqueue_attrs()
padata will use these these interfaces in a later patch, so unconfine them.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:39 +10:00
Daniel Jordan
b128a30409 padata: allocate workqueue internally
Move workqueue allocation inside of padata to prepare for further
changes to how padata uses workqueues.

Guarantees the workqueue is created with max_active=1, which padata
relies on to work correctly.  No functional change.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:39 +10:00
Miles Chen
4adcdcea71 sched/psi: Correct overly pessimistic size calculation
When passing a equal or more then 32 bytes long string to psi_write(),
psi_write() copies 31 bytes to its buf and overwrites buf[30]
with '\0'. Which makes the input string 1 byte shorter than
it should be.

Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf).

This does not cause problems in normal use case like:
"some 500000 10000000" or "full 500000 10000000" because they
are less than 32 bytes in length.

	/* assuming nbytes == 35 */
	char buf[32];

	buf_size = min(nbytes, (sizeof(buf) - 1)); /* buf_size = 31 */
	if (copy_from_user(buf, user_buf, buf_size))
		return -EFAULT;

	buf[buf_size - 1] = '\0'; /* buf[30] = '\0' */

Before:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [   22.473497] nbytes=35,buf_size=31
 [   22.473775] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [   64.916162] nbytes=32,buf_size=31
 [   64.916331] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

After:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [  254.837863] nbytes=35,buf_size=32
 [  254.838541] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [ 9965.714935] nbytes=32,buf_size=32
 [ 9965.715096] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

Also remove the superfluous parentheses.

Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-13 07:49:28 +02:00
Quentin Perret
eb92692b25 sched/fair: Speed-up energy-aware wake-ups
EAS computes the energy impact of migrating a waking task when deciding
on which CPU it should run. However, the current approach is known to
have a high algorithmic complexity, which can result in prohibitively
high wake-up latencies on systems with complex energy models, such as
systems with per-CPU DVFS. On such systems, the algorithm complexity is
in O(n^2) (ignoring the cost of searching for performance states in the
EM) with 'n' the number of CPUs.

To address this, re-factor the EAS wake-up path to compute the energy
'delta' (with and without the task) on a per-performance domain basis,
rather than system-wide, which brings the complexity down to O(n).

No functional changes intended.

Test results
~~~~~~~~~~~~

* Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs,
  A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android
  userspace, no graphics.

* Test case:  Run a periodic rt-app task, with 16ms period, ramping down
  from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]).
  Frequencies of all CPUs are pinned to max (using scaling_min_freq
  CPUFreq sysfs entries) to reduce variability. The time to run
  select_task_rq_fair() is measured using the function profiler
  (/sys/kernel/debug/tracing/trace_stat/function*). See the test script
  for more details [2].

Test 1:

I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one
CPUFreq policy per CPU (8 policies in total). Since all frequencies are
pinned to max for the test, this should have no impact on the actual
frequency selection, but it does in the EAS calculation.

      +---------------------------+----------------------------------+
      | Without patch             | With patch                       |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us)        | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
|  0  | 274 | 38.303   | 1750.239 | 401 | 14.126 (-63.1%) | 146.625  |
|  1  | 197 | 49.529   | 1695.852 | 314 | 16.135 (-67.4%) | 167.525  |
|  2  | 142 | 34.296   | 1758.665 | 302 | 14.133 (-58.8%) | 130.071  |
|  3  | 172 | 31.734   | 1490.975 | 641 | 14.637 (-53.9%) | 139.189  |
|  4  | 316 | 7.834    | 178.217  | 425 | 5.413  (-30.9%) | 20.803   |
|  5  | 447 | 8.424    | 144.638  | 556 | 5.929  (-29.6%) | 27.301   |
|  6  | 581 | 14.886   | 346.793  | 456 | 5.711  (-61.6%) | 23.124   |
|  7  | 456 | 10.005   | 211.187  | 997 | 4.708  (-52.9%) | 21.144   |
+-----+-----+----------+----------+-----+-----------------+----------+
             * Hit, Avg and s^2 are as reported by the function profiler

Test 2:
I also ran the same test with a normal DT, with 2 CPUFreq policies, to
see if this causes regressions in the most common case.

      +---------------------------+----------------------------------+
      | Without patch             | With patch                       |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us)        | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
|  0  | 345 | 22.184   | 215.321  | 580 | 18.635 (-16.0%) | 146.892  |
|  1  | 358 | 18.597   | 200.596  | 438 | 12.934 (-30.5%) | 104.604  |
|  2  | 359 | 25.566   | 200.217  | 397 | 10.826 (-57.7%) | 74.021   |
|  3  | 362 | 16.881   | 200.291  | 718 | 11.455 (-32.1%) | 102.280  |
|  4  | 457 | 3.822    | 9.895    | 757 | 4.616  (+20.8%) | 13.369   |
|  5  | 344 | 4.301    | 7.121    | 594 | 5.320  (+23.7%) | 18.798   |
|  6  | 472 | 4.326    | 7.849    | 464 | 5.648  (+30.6%) | 22.022   |
|  7  | 331 | 4.630    | 13.937   | 408 | 5.299  (+14.4%) | 18.273   |
+-----+-----+----------+----------+-----+-----------------+----------+
             * Hit, Avg and s^2 are as reported by the function profiler

In addition to these two tests, I also ran 50 iterations of the Lisa
EAS functional test suite [3] with this patch applied on Arm Juno r0,
Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions
(all EAS functional tests are passing).

 [1] https://paste.debian.net/1100055/
 [2] https://paste.debian.net/1100057/
 [3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: qperret@qperret.net
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190912094404.13802-1-qperret@qperret.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-13 07:45:17 +02:00
Roman Gushchin
97a6136983 cgroup: freezer: fix frozen state inheritance
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.

The problem can be reproduced with the test_cgfreezer_mkdir test.

This is the output before this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
  not ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

And with this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

Reported-by: Mark Crossen <mcrossen@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-09-12 14:04:45 -07:00
Linus Torvalds
98dcb386e5 for-linus-20190912
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXXpIKgAKCRCRxhvAZXjc
 om8yAQCIPJp2HWNsJRPRl9KVKRmR6MxItG1Hpj0MvgwzLEjufwD/SF9VAPgl2AmD
 oaAXrOYH4yhr9aaFlVuMpe2aWAZdxgU=
 =Tjvf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux

Pull clone3 fix from Christian Brauner:
 "This is a last-minute bugfix for clone3() that should go in before we
  release 5.3 with clone3().

  clone3() did not verify that the exit_signal argument was set to a
  valid signal. This can be used to cause a crash by specifying a signal
  greater than NSIG. e.g. -1.

  The commit from Eugene adds a check to copy_clone_args_from_user() to
  verify that the exit signal is limited by CSIGNAL as with legacy
  clone() and that the signal is valid. With this we don't get the
  legacy clone behavior were an invalid signal could be handed down and
  would only be detected and then ignored in do_notify_parent(). Users
  of clone3() will now get a proper error right when they pass an
  invalid exit signal. Note, that this is not a change in user-visible
  behavior since no kernel with clone3() has been released yet"

* tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
  fork: block invalid exit signals with clone3()
2019-09-12 14:50:14 +01:00
Eugene Syromiatnikov
a0eb9abd8a
fork: block invalid exit signals with clone3()
Previously, higher 32 bits of exit_signal fields were lost when copied
to the kernel args structure (that uses int as a type for the respective
field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
it has to be checked for sanity before use; for the legacy syscalls,
applying CSIGNAL mask guarantees that it is at least non-negative;
however, there's no such thing is done in clone3() code path, and that
can break at least thread_group_leader.

This commit adds a check to copy_clone_args_from_user() to verify that
the exit signal is limited by CSIGNAL as with legacy clone() and that
the signal is valid. With this we don't get the legacy clone behavior
were an invalid signal could be handed down and would only be detected
and ignored in do_notify_parent(). Users of clone3() will now get a
proper error when they pass an invalid exit signal. Note, that this is
not user-visible behavior since no kernel with clone3() has been
released yet.

The following program will cause a splat on a non-fixed clone3() version
and will fail correctly on a fixed version:

 #define _GNU_SOURCE
 #include <linux/sched.h>
 #include <linux/types.h>
 #include <sched.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/syscall.h>
 #include <sys/wait.h>
 #include <unistd.h>

 int main(int argc, char *argv[])
 {
        pid_t pid = -1;
        struct clone_args args = {0};
        args.exit_signal = -1;

        pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
        if (pid < 0)
                exit(EXIT_FAILURE);

        if (pid == 0)
                exit(EXIT_SUCCESS);

        wait(NULL);

        exit(EXIT_SUCCESS);
 }

Fixes: 7f192e3cd3 ("fork: add clone3")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com
[christian.brauner@ubuntu.com: simplify check and rework commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-09-12 14:56:33 +02:00
Linus Torvalds
6dcf6a4eb9 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "Fix an initialization bug in the hw-breakpoints, which triggered on
  the ARM platform"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization
2019-09-12 11:04:50 +01:00
Linus Torvalds
95779fe850 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "Fix a race in the IRQ resend mechanism, which can result in a NULL
  dereference crash"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent NULL pointer dereference in resend_irqs()
2019-09-12 11:02:00 +01:00
Masahiro Yamada
b605be6581 module: remove unneeded casts in cmp_name()
You can pass opaque pointers directly.

I also renamed 'va' and 'vb' into more meaningful arguments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-11 21:41:56 +02:00
Will Deacon
069e1c07c1 module: Fix link failure due to invalid relocation on namespace offset
Commit 8651ec01da ("module: add support for symbol namespaces.")
broke linking for arm64 defconfig:

  | lib/crypto/arc4.o: In function `__ksymtab_arc4_setkey':
  | arc4.c:(___ksymtab+arc4_setkey+0x8): undefined reference to `no symbol'
  | lib/crypto/arc4.o: In function `__ksymtab_arc4_crypt':
  | arc4.c:(___ksymtab+arc4_crypt+0x8): undefined reference to `no symbol'

This is because the dummy initialisation of the 'namespace_offset' field
in 'struct kernel_symbol' when using EXPORT_SYMBOL on architectures with
support for PREL32 locations uses an offset from an absolute address (0)
in an effort to trick 'offset_to_pointer' into behaving as a NOP,
allowing non-namespaced symbols to be treated in the same way as those
belonging to a namespace.

Unfortunately, place-relative relocations require a symbol reference
rather than an absolute value and, although x86 appears to get away with
this due to placing the kernel text at the top of the address space, it
almost certainly results in a runtime failure if the kernel is relocated
dynamically as a result of KASLR.

Rework 'namespace_offset' so that a value of 0, which cannot occur for a
valid namespaced symbol, indicates that the corresponding symbol does
not belong to a namespace.

Cc: Matthias Maennich <maennich@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 8651ec01da ("module: add support for symbol namespaces.")
Reported-by: kbuild test robot <lkp@intel.com>
Tested-by: Matthias Maennich <maennich@google.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matthias Maennich <maennich@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-11 18:53:30 +02:00
Joerg Roedel
e95adb9add Merge branches 'arm/omap', 'arm/exynos', 'arm/smmu', 'arm/mediatek', 'arm/qcom', 'arm/renesas', 'x86/amd', 'x86/vt-d' and 'core' into next 2019-09-11 12:39:19 +02:00
Lu Baolu
3fc1ca0065 swiotlb: Split size parameter to map/unmap APIs
This splits the size parameter to swiotlb_tbl_map_single() and
swiotlb_tbl_unmap_single() into an alloc_size and a mapping_size
parameter, where the latter one is rounded up to the iommu page
size.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-09-11 12:34:29 +02:00
Eric W. Biederman
821cc7b0b2
waitid: Add support for waiting for the current process group
It was recently discovered that the linux version of waitid is not a
superset of the other wait functions because it does not include support
for waiting for the current process group. This has two downsides:
1. An extra system call is needed to get the current process group.
2. After the current process group is received and before it is passed
   to waitid a signal could arrive causing the current process group to change.
   Inherent race-conditions as these make it impossible for userspace to
   emulate this functionaly and thus violate async-signal safety
   requirements for waitpid.

Arguments can be made for using a different choice of idtype and id
for this case but the BSDs already use this P_PGID and 0 to indicate
waiting for the current process's process group.  So be nice to user
space programmers and don't introduce an unnecessary incompatibility.

Some people have noted that the posix description is that
waitpid will wait for the current process group, and that in
the presence of pthreads that process group can change.  To get
clarity on this issue I looked at XNU, FreeBSD, and Luminos.  All of
those flavors of unix waited for the current process group at the
time of call and as written could not adapt to the process group
changing after the call.

At one point Linux did adapt to the current process group changing but
that stopped in 161550d74c ("pid: sys_wait... fixes").  It has been
over 11 years since Linux has that behavior, no programs that fail
with the change in behavior have been reported, and I could not
find any other unix that does this.  So I think it is safe to clarify
the definition of current process group, to current process group
at the time of the wait function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Alistair Francis <alistair23@gmail.com>
Cc: Zong Li <zongbox@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Link: https://lore.kernel.org/r/20190814154400.6371-2-christian.brauner@ubuntu.com
2019-09-10 17:05:46 +02:00
Thomas Gleixner
77b4b54204 posix-cpu-timers: Fix permission check regression
The recent consolidation of the three permission checks introduced a subtle
regression. For timer_create() with a process wide timer it returns the
current task if the lookup through the PID which is encoded into the
clockid results in returning current.

That's broken because it does not validate whether the current task is the
group leader.

That was caused by the two different variants of permission checks:

  - posix_cpu_timer_get() allowed access to the process wide clock when the
    looked up task is current. That's not an issue because the process wide
    clock is in the shared sighand.

  - posix_cpu_timer_create() made sure that the looked up task is the group
    leader.

Restore the previous state.

Note, that these permission checks are more than questionable, but that's
subject to follow up changes.

Fixes: 6ae40e3fdc ("posix-cpu-timers: Provide task validation functions")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1909052314110.1902@nanos.tec.linutronix.de
2019-09-10 12:13:07 +01:00
Matthias Maennich
3d52ec5e5d module: add config option MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
If MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is enabled (default=n), the
requirement for modules to import all namespaces that are used by
the module is relaxed.

Enabling this option effectively allows (invalid) modules to be loaded
while only a warning is emitted.

Disabling this option keeps the enforcement at module loading time and
loading is denied if the module's imports are not satisfactory.

Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:27 +02:00
Matthias Maennich
8651ec01da module: add support for symbol namespaces.
The EXPORT_SYMBOL_NS() and EXPORT_SYMBOL_NS_GPL() macros can be used to
export a symbol to a specific namespace.  There are no _GPL_FUTURE and
_UNUSED variants because these are currently unused, and I'm not sure
they are necessary.

I didn't add EXPORT_SYMBOL_NS() for ASM exports; this patch sets the
namespace of ASM exports to NULL by default. In case of relative
references, it will be relocatable to NULL. If there's a need, this
should be pretty easy to add.

A module that wants to use a symbol exported to a namespace must add a
MODULE_IMPORT_NS() statement to their module code; otherwise, modpost
will complain when building the module, and the kernel module loader
will emit an error and fail when loading the module.

MODULE_IMPORT_NS() adds a modinfo tag 'import_ns' to the module. That
tag can be observed by the modinfo command, modpost and kernel/module.c
at the time of loading the module.

The ELF symbols are renamed to include the namespace with an asm label;
for example, symbol 'usb_stor_suspend' in namespace USB_STORAGE becomes
'usb_stor_suspend.USB_STORAGE'.  This allows modpost to do namespace
checking, without having to go through all the effort of parsing ELF and
relocation records just to get to the struct kernel_symbols.

On x86_64 I saw no difference in binary size (compression), but at
runtime this will require a word of memory per export to hold the
namespace. An alternative could be to store namespaced symbols in their
own section and use a separate 'struct namespaced_kernel_symbol' for
that section, at the cost of making the module loader more complex.

Co-developed-by: Martijn Coenen <maco@android.com>
Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:17 +02:00
Matthias Maennich
c5e4a062fe module: support reading multiple values per modinfo tag
Similar to modpost's get_next_modinfo(), introduce get_next_modinfo() in
kernel/module.c to acquire any further values associated with the same
modinfo tag name. That is useful for any tags that have multiple
occurrences (such as 'alias'), but is in particular introduced here as
part of the symbol namespaces patch series to read the (potentially)
multiple namespaces a module is importing.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:02 +02:00
Daniel Vetter
312364f353 kernel.h: Add non_block_start/end()
In some special cases we must not block, but there's not a spinlock,
preempt-off, irqs-off or similar critical section already that arms the
might_sleep() debug checks. Add a non_block_start/end() pair to annotate
these.

This will be used in the oom paths of mmu-notifiers, where blocking is not
allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an indirect
dependency upon the page allocator and hence close the loop with the oom
reaper.

Suggested by Michal Hocko.

Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:05 -03:00
Sven Schnelle
ea46a13ebf kexec_elf: support 32 bit ELF files
The powerpc version only supported 64 bit. Add some
code to switch decoding of fields during runtime so
we can kexec a 32 bit kernel from a 64 bit kernel and
vice versa.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:44 +02:00
Sven Schnelle
571ceb7d96 kexec_elf: remove unused variable in kexec_elf_load()
base was never assigned, so we can remove it.

Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:44 +02:00
Sven Schnelle
3bd9c3366e kexec_elf: remove Elf_Rel macro
It wasn't used anywhere, so lets drop it.

Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
10ba459f87 kexec_elf: remove PURGATORY_STACK_SIZE
It's not used anywhere so just drop it.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
5f71d97720 kexec_elf: remove parsing of section headers
We're not using them, so we can drop the parsing.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
d34e0ad3ea kexec_elf: change order of elf_*_to_cpu() functions
Change the order to have a 64/32/16 order, no functional change.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
175fca3bf9 kexec: add KEXEC_ELF
Right now powerpc provides an implementation to read elf files
with the kexec_file_load() syscall. Make that available as a public
kexec interface so it can be re-used on other architectures.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
David S. Miller
1e46c09ec1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add the ability to use unaligned chunks in the AF_XDP umem. By
   relaxing where the chunks can be placed, it allows to use an
   arbitrary buffer size and place whenever there is a free
   address in the umem. Helps more seamless DPDK AF_XDP driver
   integration. Support for i40e, ixgbe and mlx5e, from Kevin and
   Maxim.

2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
   application can wake up the kernel for rx/tx processing which
   avoids busy-spinning of the latter, useful when app and driver
   is located on the same core. Support for i40e, ixgbe and mlx5e,
   from Magnus and Maxim.

3) bpftool fixes for printf()-like functions so compiler can actually
   enforce checks, bpftool build system improvements for custom output
   directories, and addition of 'bpftool map freeze' command, from Quentin.

4) Support attaching/detaching XDP programs from 'bpftool net' command,
   from Daniel.

5) Automatic xskmap cleanup when AF_XDP socket is released, and several
   barrier/{read,write}_once fixes in AF_XDP code, from Björn.

6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
   inclusion as well as libbpf versioning improvements, from Andrii.

7) Several new BPF kselftests for verifier precision tracking, from Alexei.

8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

9) And more BPF kselftest improvements all over the place, from Stanislav.

10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06 16:49:17 +02:00
Thomas Gleixner
9cc5b7fba5 irqchip updates for Linux 5.4
- Large GICv3 updates to support new PPI and SPI ranges
 - Conver all alloc_fwnode() users to use PAs instead of VAs
 - Add support for Marvell's MMP3 irqchip
 - Add support for Amlogic Meson SM1
 - Various cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1yLdwPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDVHgP+wTAtNzxF4nRsJO3ZlbKe7n2ut7toZvqip7U
 97Ll652m/24stioU7b4/Z92E1fWIQe6Bv7u5detNQ4bIg2uwD6ZDYMjLLOXsJXrQ
 XH7pEAVe2eO0rjZudodUteyPGZQAS1/okY/4qsYNDiWKHi5xQPwMBNJ6fVwOVXUk
 HG7ClMny1i7SSRm3CGmW3Hxb279DrHOxCGPO6aEX5UQ/uZErCTJN5oFdLfAKvfaU
 qRpzetq4QOO+L7UBer6+jmtAvA/kuYwpGSnhJXhFOLMSgrGLQbINvYeAyv2vKGvF
 l6HQNuEf+OQjR5RcUZi6qbAO/WEh2xzmxpqta8COwjBFDcwsITyc5wNRNEfmNqO9
 Yrh3/a/J0eWX6X4rmmJkMDEBQvNbyA1//9d5NIjPhIVERZYPa4Z2lb220P4+58r7
 xnCqZnxZO78BhUd7HXeCluuTJs83iHxs03nAHysU9YZWkiF7jDXE0d3kla8cGcdY
 aOaSID0Q52flRxLymWJs29kyXZU8rpiPTbxvi86Ggkvgrj2pSwb8PF8LL4WPkzgh
 BcuF2ca9oa22OlA3oLV/sKwFRS77en+bpzLeU0NjeAcu5b8AY2nSSYXiuvtxIzi3
 FbQCsBBl7i5F5AQT9XGugjeNsCRUsUvIOP4f4w2Ej6HmqJb/SJVMBedMBsjGoEFG
 sj93Nc7f
 =PsGr
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates for Linux 5.4 from Marc Zyngier:

 - Large GICv3 updates to support new PPI and SPI ranges
 - Conver all alloc_fwnode() users to use PAs instead of VAs
 - Add support for Marvell's MMP3 irqchip
 - Add support for Amlogic Meson SM1
 - Various cleanups and fixes
2019-09-06 13:03:39 +02:00
Mark-PK Tsai
310aa0a25b perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization
If we disable the compiler's auto-initialization feature, if
-fplugin-arg-structleak_plugin-byref or -ftrivial-auto-var-init=pattern
are disabled, arch_hw_breakpoint may be used before initialization after:

  9a4903dde2 ("perf/hw_breakpoint: Split attribute parse and commit")

On our ARM platform, the struct step_ctrl in arch_hw_breakpoint, which
used to be zero-initialized by kzalloc(), may be used in
arch_install_hw_breakpoint() without initialization.

Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alix Wu <alix.wu@mediatek.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Link: https://lkml.kernel.org/r/20190906060115.9460-1-mark-pk.tsai@mediatek.com
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06 08:24:01 +02:00
Ingo Molnar
9326011edf Merge branch 'x86/cleanups' into x86/cpu, to pick up dependent changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06 07:30:23 +02:00
Yunfeng Ye
eddf3e9c7c genirq: Prevent NULL pointer dereference in resend_irqs()
The following crash was observed:

  Unable to handle kernel NULL pointer dereference at 0000000000000158
  Internal error: Oops: 96000004 [#1] SMP
  pc : resend_irqs+0x68/0xb0
  lr : resend_irqs+0x64/0xb0
  ...
  Call trace:
   resend_irqs+0x68/0xb0
   tasklet_action_common.isra.6+0x84/0x138
   tasklet_action+0x2c/0x38
   __do_softirq+0x120/0x324
   run_ksoftirqd+0x44/0x60
   smpboot_thread_fn+0x1ac/0x1e8
   kthread+0x134/0x138
   ret_from_fork+0x10/0x18

The reason for this is that the interrupt resend mechanism happens in soft
interrupt context, which is a asynchronous mechanism versus other
operations on interrupts. free_irq() does not take resend handling into
account. Thus, the irq descriptor might be already freed before the resend
tasklet is executed. resend_irqs() does not check the return value of the
interrupt descriptor lookup and derefences the return value
unconditionally.

  1):
  __setup_irq
    irq_startup
      check_irq_resend  // activate softirq to handle resend irq
  2):
  irq_domain_free_irqs
    irq_free_descs
      free_desc
        call_rcu(&desc->rcu, delayed_free_desc)
  3):
  __do_softirq
    tasklet_action
      resend_irqs
        desc = irq_to_desc(irq)
        desc->handle_irq(desc)  // desc is NULL --> Ooops

Fix this by adding a NULL pointer check in resend_irqs() before derefencing
the irq descriptor.

Fixes: a4633adcdb ("[PATCH] genirq: add genirq sw IRQ-retrigger")
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.com
2019-09-05 21:31:14 +02:00
Thadeu Lima de Souza Cascardo
f18ddc13af alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP
ENOTSUPP is not supposed to be returned to userspace. This was found on an
OpenPower machine, where the RTC does not support set_alarm.

On that system, a clock_nanosleep(CLOCK_REALTIME_ALARM, ...) results in
"524 Unknown error 524"

Replace it with EOPNOTSUPP which results in the expected "95 Operation not
supported" error.

Fixes: 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no RTC device is present)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190903171802.28314-1-cascardo@canonical.com
2019-09-05 21:19:26 +02:00
Zhengjun Xing
ac68154626 tracing: Add "gfp_t" support in synthetic_events
Add "gfp_t" support in synthetic_events, then the "gfp_t" type
parameter in some functions can be traced.

Prints the gfp flags as hex in addition to the human-readable flag
string.  Example output:

  whoopsie-630 [000] ...1 78.969452: testevent: bar=b20 (GFP_ATOMIC|__GFP_ZERO)
    rcuc/0-11  [000] ...1 81.097555: testevent: bar=a20 (GFP_ATOMIC)
    rcuc/0-11  [000] ...1 81.583123: testevent: bar=a20 (GFP_ATOMIC)

Link: http://lkml.kernel.org/r/20190712015308.9908-1-zhengjun.xing@linux.intel.com

Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
[ Added printing of flag names ]
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-05 11:35:14 -04:00
Alexei Starovoitov
2339cd6cd0 bpf: fix precision tracking of stack slots
The problem can be seen in the following two tests:
0: (bf) r3 = r10
1: (55) if r3 != 0x7b goto pc+0
2: (7a) *(u64 *)(r3 -8) = 0
3: (79) r4 = *(u64 *)(r10 -8)
..
0: (85) call bpf_get_prandom_u32#7
1: (bf) r3 = r10
2: (55) if r3 != 0x7b goto pc+0
3: (7b) *(u64 *)(r3 -8) = r0
4: (79) r4 = *(u64 *)(r10 -8)

When backtracking need to mark R4 it will mark slot fp-8.
But ST or STX into fp-8 could belong to the same block of instructions.
When backtracing is done the parent state may have fp-8 slot
as "unallocated stack". Which will cause verifier to warn
and incorrectly reject such programs.

Writes into stack via non-R10 register are rare. llvm always
generates canonical stack spill/fill.
For such pathological case fall back to conservative precision
tracking instead of rejecting.

Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-05 14:06:58 +02:00
Sebastian Andrzej Siewior
5d2295f3a9 hrtimer: Add a missing bracket and hide `migration_base' on !SMP
The recent change to avoid taking the expiry lock when a timer is currently
migrated missed to add a bracket at the end of the if statement leading to
compile errors.  Since that commit the variable `migration_base' is always
used but it is only available on SMP configuration thus leading to another
compile error.  The changelog says "The timer base and base->cpu_base
cannot be NULL in the code path", so it is safe to limit this check to SMP
configurations only.

Add the missing bracket to the if statement and hide `migration_base'
behind CONFIG_SMP bars.

[ tglx: Mark the functions inline ... ]

Fixes: 68b2c8c1e4 ("hrtimer: Don't take expiry_lock when timer is currently migrated")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190904145527.eah7z56ntwobqm6j@linutronix.de
2019-09-05 10:39:06 +02:00
Masami Hiramatsu
e336b40277 kprobes: Prohibit probing on BUG() and WARN() address
Since BUG() and WARN() may use a trap (e.g. UD2 on x86) to
get the address where the BUG() has occurred, kprobes can not
do single-step out-of-line that instruction. So prohibit
probing on such address.

Without this fix, if someone put a kprobe on WARN(), the
kernel will crash with invalid opcode error instead of
outputing warning message, because kernel can not find
correct bug address.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/156750890133.19112.3393666300746167111.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-05 10:15:16 +02:00
Ingo Molnar
1251201c0d sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code
Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels:

  $ chrt -p $$
  chrt: failed to get pid 26306's policy: Argument list too long

and he has root-caused the bug to the following commit increasing sched_attr
size and breaking sched_read_attr() into returning -EFBIG:

  a509a7cd79 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")

The other, bigger bug is that the whole sched_getattr() and sched_read_attr()
logic of checking non-zero bits in new ABI components is arguably broken,
and pretty much any extension of the ABI will spuriously break the ABI.
That's way too fragile.

Instead implement the perf syscall's extensible ABI instead, which we
already implement on the sched_setattr() side:

 - if user-attributes have the same size as kernel attributes then the
   logic is unchanged.

 - if user-attributes are larger than the kernel knows about then simply
   skip the extra bits, but set attr->size to the (smaller) kernel size
   so that tooling can (in principle) handle older kernel as well.

 - if user-attributes are smaller than the kernel knows about then just
   copy whatever user-space can accept.

Also clean up the whole logic:

 - Simplify the code flow - there's no need for 'ret' for example.

 - Standardize on 'kattr/uattr' and 'ksize/usize' naming to make sure we
   always know which side we are dealing with.

 - Why is it called 'read' when what it does is to copy to user? This
   code is so far away from VFS read() semantics that the naming is
   actively confusing. Name it sched_attr_copy_to_user() instead, which
   mirrors other copy_to_user() functionality.

 - Move the attr->size assignment from the head of sched_getattr() to the
   sched_attr_copy_to_user() function. Nothing else within the kernel
   should care about the size of the structure.

With these fixes the sched_getattr() syscall now nicely supports an
extensible ABI in both a forward and backward compatible fashion, and
will also fix the chrt bug.

As an added bonus the bogus -EFBIG return is removed as well, which as
Thadeu noted should have been -E2BIG to begin with.

Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a509a7cd79 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
Link: https://lkml.kernel.org/r/20190904075532.GA26751@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-04 19:51:30 +02:00
Masahiro Yamada
858805b336 kbuild: add $(BASH) to run scripts with bash-extension
CONFIG_SHELL falls back to sh when bash is not installed on the system,
but nobody is testing such a case since bash is usually installed.
So, shell scripts invoked by CONFIG_SHELL are only tested with bash.

It makes it difficult to test whether the hashbang #!/bin/sh is real.
For example, #!/bin/sh in arch/powerpc/kernel/prom_init_check.sh is
false. (I fixed it up)

Besides, some shell scripts invoked by CONFIG_SHELL use bash-extension
and #!/bin/bash is specified as the hashbang, while CONFIG_SHELL may
not always be set to bash.

Probably, the right thing to do is to introduce BASH, which is bash by
default, and always set CONFIG_SHELL to sh. Replace $(CONFIG_SHELL)
with $(BASH) for bash scripts.

If somebody tries to add bash-extension to a #!/bin/sh script, it will
be caught in testing because /bin/sh is a symlink to dash on some major
distributions.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-09-04 22:54:13 +09:00
Christoph Hellwig
5cf4537975 dma-mapping: introduce a dma_common_find_pages helper
A helper to find the backing page array based on a virtual address.
This also ensures we do the same vm_flags check everywhere instead
of slightly different or missing ones in a few places.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:20 +02:00
Christoph Hellwig
512317401f dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
Currently the generic dma remap allocator gets a vm_flags passed by
the caller that is a little confusing.  We just introduced a generic
vmalloc-level flag to identify the dma coherent allocations, so use
that everywhere and remove the now pointless argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:20 +02:00
Christoph Hellwig
249baa5479 dma-mapping: provide a better default ->get_required_mask
Most dma_map_ops instances are IOMMUs that work perfectly fine in 32-bits
of IOVA space, and the generic direct mapping code already provides its
own routines that is intelligent based on the amount of memory actually
present.  Wire up the dma-direct routine for the ARM direct mapping code
as well, and otherwise default to the constant 32-bit mask.  This way
we only need to override it for the occasional odd IOMMU that requires
64-bit IOVA support, or IOMMU drivers that are more efficient if they
can fall back to the direct mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
d9295532d5 dma-mapping: remove the dma_declare_coherent_memory export
dma_declare_coherent_memory is something that the platform setup code
(which pretty much means the device tree these days) need to do so that
drivers can use the memory as declared by the platform.  Drivers
themselves have no business calling this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
7a01ee4220 dma-mapping: remove the dma_mmap_from_dev_coherent export
dma_mmap_from_dev_coherent is only used by dma_map_ops instances,
none of which is modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
1fa0682448 dma-mapping: remove dma_release_declared_memory
This function is entirely unused given that declared memory is
generally provided by platform setup code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
62fcee9a3b dma-mapping: remove CONFIG_ARCH_NO_COHERENT_DMA_MMAP
CONFIG_ARCH_NO_COHERENT_DMA_MMAP is now functionally identical to
!CONFIG_MMU, so remove the separate symbol.  The only difference is that
arm did not set it for !CONFIG_MMU, but arm uses a separate dma mapping
implementation including its own mmap method, which is handled by moving
the CONFIG_MMU check in dma_can_mmap so that is only applies to the
dma-direct case, just as the other ifdefs for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	# m68k
2019-09-04 11:13:18 +02:00
Christoph Hellwig
e29ccc188f dma-mapping: add a dma_can_mmap helper
Add a helper to check if DMA allocations for a specific device can be
mapped to userspace using dma_mmap_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:18 +02:00
Christoph Hellwig
f9f3232a7d dma-mapping: explicitly wire up ->mmap and ->get_sgtable
While the default ->mmap and ->get_sgtable implementations work for the
majority of our dma_map_ops impementations they are inherently safe
for others that don't use the page allocator or CMA and/or use their
own way of remapping not covered by the common code.  So remove the
defaults if these methods are not wired up, but instead wire up the
default implementations for all safe instances.

Fixes: e1c7e32453 ("dma-mapping: always provide the dma_map_ops based implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:18 +02:00
Christoph Hellwig
1445146701 dma-mapping: move the dma_get_sgtable API comments from arm to common code
The comments are spot on and should be near the central API, not just
near a single implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:17 +02:00
Nadav Amit
d8a050f5a3 kgdb: fix comment regarding static function
The comment that says that module_event() is not static is clearly
wrong.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-09-03 11:19:41 +01:00
Chuhong Yuan
635714312e kdb: Replace strncmp with str_has_prefix
strncmp(str, const, len) is error-prone.
We had better use newly introduced
str_has_prefix() instead of it.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-09-03 11:19:31 +01:00
Daniel Lezcano
82e430a6df cpuidle: play_idle: Increase the resolution to usec
The play_idle resolution is 1ms. The intel_powerclamp bases the idle
duration on jiffies. The idle injection API is also using msec based
duration but has no user yet.

Unfortunately, msec based time does not fit well when we want to
inject idle cycle precisely with shallow idle state.

In order to set the scene for the incoming idle injection user, move
the precision up to usec when calling play_idle.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-09-03 11:33:29 +02:00
Dexuan Cui
711419e504 irqdomain: Add the missing assignment of domain->fwnode for named fwnode
Recently device pass-through stops working for Linux VM running on Hyper-V.

git-bisect shows the regression is caused by the recent commit
467a3bb974 ("PCI: hv: Allocate a named fwnode ..."), but the root cause
is that the commit d59f6617ee forgets to set the domain->fwnode for
IRQCHIP_FWNODE_NAMED*, and as a result:

1. The domain->fwnode remains to be NULL.

2. irq_find_matching_fwspec() returns NULL since "h->fwnode == fwnode" is
false, and pci_set_bus_msi_domain() sets the Hyper-V PCI root bus's
msi_domain to NULL.

3. When the device is added onto the root bus, the device's dev->msi_domain
is set to NULL in pci_set_msi_domain().

4. When a device driver tries to enable MSI-X, pci_msi_setup_msi_irqs()
calls arch_setup_msi_irqs(), which uses the native MSI chip (i.e.
arch/x86/kernel/apic/msi.c: pci_msi_controller) to set up the irqs, but
actually pci_msi_setup_msi_irqs() is supposed to call
msi_domain_alloc_irqs() with the hbus->irq_domain, which is created in
hv_pcie_init_irq_domain() and is associated with the Hyper-V chip
hv_msi_irq_chip. Consequently, the irq line is not properly set up, and
the device driver can not receive any interrupt.

Fixes: d59f6617ee ("genirq: Allow fwnode to carry name information only")
Fixes: 467a3bb974 ("PCI: hv: Allocate a named fwnode instead of an address-based one")
Reported-by: Lili Deng <v-lide@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/PU1P153MB01694D9AF625AC335C600C5FBFBE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
2019-09-03 09:16:50 +01:00
Patrick Bellasi
0413d7f33e sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
The supported clamp indexes are defined in 'enum clamp_id', however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use 'unsigned int' to represent indices.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:40 +02:00
Patrick Bellasi
babbe170e0 sched/uclamp: Update CPU's refcount on TG's clamp changes
On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:40 +02:00
Patrick Bellasi
3eac870a32 sched/uclamp: Use TG's clamps to restrict TASK's clamps
When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:39 +02:00
Patrick Bellasi
7274a5c1bb sched/uclamp: Propagate system defaults to the root group
The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:38 +02:00
Patrick Bellasi
0b60ba2dd3 sched/uclamp: Propagate parent clamps
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:38 +02:00
Patrick Bellasi
2480c09313 sched/uclamp: Extend CPU's cgroup controller
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      minimum frequency which corresponds to the uclamp.min
	      utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      maximum frequency which corresponds to the uclamp.max
	      utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:37 +02:00
Matt Fleming
a55c7454a8 sched/topology: Improve load balancing on AMD EPYC systems
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

  commit 32e45ff43e ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

 node distances:
 node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:37 +02:00
Liangyan
5e2d2cc258 sched/fair: Don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.

We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet:

  distribute_cfs_runtime() {
    runtime = -cfs_rq->runtime_remaining + 1;
  }

The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.

According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.

Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: shanpeic@linux.alibaba.com
Cc: stable@vger.kernel.org
Cc: xlpang@linux.alibaba.com
Fixes: d3d9dc3302 ("sched: Throttle entities exceeding their allowed bandwidth")
Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 08:55:07 +02:00
Yoshihiro Shimoda
6ba99411b8 dma-mapping: introduce dma_get_merge_boundary()
This patch adds a new DMA API "dma_get_merge_boundary". This function
returns the DMA merge boundary if the DMA layer can merge the segments.
This patch also adds the implementation for a new dma_map_ops pointer.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03 08:33:01 +02:00
David S. Miller
765b7590c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
r8152 conflicts are the NAPI fixes in 'net' overlapping with
some tasklet stuff in net-next

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-02 11:20:17 -07:00
Ingo Molnar
e98db89489 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-02 09:12:21 +02:00
Linus Torvalds
345464fb76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix some length checks during OGM processing in batman-adv, from
    Sven Eckelmann.

 2) Fix regression that caused netfilter conntrack sysctls to not be
    per-netns any more. From Florian Westphal.

 3) Use after free in netpoll, from Feng Sun.

 4) Guard destruction of pfifo_fast per-cpu qdisc stats with
    qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
    in pfifo_fast_enqueue().

 5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.

 6) Handle neigh events on internal ports correctly in nfp, from John
    Hurley.

 7) Clear SKB timestamp in NF flow table code so that it does not
    confuse fq scheduler. From Florian Westphal.

 8) taprio destroy can crash if it is invoked in a failure path of
    taprio_init(), because the list head isn't setup properly yet and
    the list del is unconditional. Perform the list add earlier to
    address this. From Vladimir Oltean.

 9) Make sure to reapply vlan filters on device up, in aquantia driver.
    From Dmitry Bogdanov.

10) sgiseeq driver releases DMA memory using free_page() instead of
    dma_free_attrs(). From Christophe JAILLET.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  net: seeq: Fix the function used to release some memory in an error handling path
  enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
  net: bcmgenet: use ethtool_op_get_ts_info()
  tc-testing: don't hardcode 'ip' in nsPlugin.py
  net: dsa: microchip: add KSZ8563 compatibility string
  dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
  net: aquantia: fix out of memory condition on rx side
  net: aquantia: linkstate irq should be oneshot
  net: aquantia: reapply vlan filters on up
  net: aquantia: fix limit of vlan filters
  net: aquantia: fix removal of vlan 0
  net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
  taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
  taprio: Fix kernel panic in taprio_destroy
  net: dsa: microchip: fill regmap_config name
  rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
  net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
  amd-xgbe: Fix error path in xgbe_mod_init()
  netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
  mac80211: Correctly set noencrypt for PAE frames
  ...
2019-09-01 18:45:28 -07:00
Steven Rostedt (VMware)
a47b53e95a tracing: Rename tracing_reset() to tracing_reset_cpu()
The name tracing_reset() was a misnomer, as it really only reset a single
CPU buffer. Rename it to tracing_reset_cpu() and also make it static and
remove the prototype from trace.h, as it is only used in a single function.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Steven Rostedt (VMware)
58fe7a87db tracing: Document the stack trace algorithm in the comments
As the max stack tracer algorithm is not that easy to understand from the
code, add comments that explain the algorithm and mentions how
ARCH_FTRACE_SHIFT_STACK_TRACER affects it.

Link: http://lkml.kernel.org/r/20190806123455.487ac02b@gandalf.local.home

Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Steven Rostedt (VMware)
f7edb451fa tracing/arm64: Have max stack tracer handle the case of return address after data
Most archs (well at least x86) store the function call return address on the
stack before storing the local variables for the function. The max stack
tracer depends on this in its algorithm to display the stack size of each
function it finds in the back trace.

Some archs (arm64), may store the return address (from its link register)
just before calling a nested function. There's no reason to save the link
register on leaf functions, as it wont be updated. This breaks the algorithm
of the max stack tracer.

Add a new define ARCH_FTRACE_SHIFT_STACK_TRACER that an architecture may set
if it stores the return address (link register) after it stores the
function's local variables, and have the stack trace shift the values of the
mapped stack size to the appropriate functions.

Link: 20190802094103.163576-1-jiping.ma2@windriver.com

Reported-by: Jiping Ma <jiping.ma2@windriver.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Masami Hiramatsu
a42e3c4de9 tracing/probe: Add immediate string parameter support
Add immediate string parameter (\"string") support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching a string from
memory.

This feature looks odd, but imagine that you put a probe
on a code to trace some string data. If the code is
compiled into 2 instructions and 1 instruction has a
string on memory but other has no string since it is
optimized out. In that case, you can not fold those into
one event, even if ftrace supported multiple probes on
one event. With this feature, you can set a dummy string
like foo=\"(optimized)":string instead of something
like foo=+0(+0(%bp)):string.

Link: http://lkml.kernel.org/r/156095691687.28024.13372712423865047991.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
6218bf9f4d tracing/probe: Add immediate parameter support
Add immediate value parameter (\1234) support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching from memory
or register.

This feature looks odd, but imagine when you put a probe
on a code to trace some data. If the code is compiled into
2 instructions and 1 instruction has a value but other has
nothing since it is optimized out.
In that case, you can not fold those into one event, even
if ftrace supported multiple probes on one event.
With this feature, you can set a dummy value like
foo=\deadbeef instead of something like foo=%di.

Link: http://lkml.kernel.org/r/156095690733.28024.13258186548822649469.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
ab10d69eb7 tracing/uprobe: Add per-probe delete from event
Add per-probe delete method from one event passing the head of
definition. In other words, the events which match the head
N parameters are deleted.

Link: http://lkml.kernel.org/r/156095689811.28024.221706761151739433.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
eb5bf81330 tracing/kprobe: Add per-probe delete from event
Allow user to delete a probe from event. This is done by head
match. For example, if we have 2 probes on an event

$ cat kprobe_events
p:kprobes/testprobe _do_fork r1=%ax r2=%dx
p:kprobes/testprobe idle_fork r1=%ax r2=%cx

Then you can remove one of them by passing the head of definition
which identify the probe.

$ echo "-:kprobes/testprobe idle_fork" >> kprobe_events

Link: http://lkml.kernel.org/r/156095688848.28024.15798690082378432435.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
41af3cf587 tracing/uprobe: Add multi-probe per uprobe event support
Allow user to define several probes on one uprobe event.
Note that this only support appending method. So deleting
event will delete all probes on the event.

Link: http://lkml.kernel.org/r/156095687876.28024.13840331032234992863.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
ca89bc071d tracing/kprobe: Add multi-probe per event support
Add multi-probe per one event support to kprobe events.
User can define several different probes on one trace event
if those events have same "event signature",
e.g.

  # echo p:testevent _do_fork > kprobe_events
  # echo p:testevent fork_idle >> kprobe_events
  # kprobe_events
  p:kprobes/testevent _do_fork
  p:kprobes/testevent fork_idle

The event signature is defined by kprobe type (retprobe or not),
the number of args, argument names, and argument types.

Note that this only support appending method. Delete event
operation will delete all probes on the event.

Link: http://lkml.kernel.org/r/156095686913.28024.9357292202316540742.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
30199137c8 tracing/dynevent: Pass extra arguments to match operation
Pass extra arguments to match operation for checking
exact match. If the event doesn't support exact match,
it will be ignored.

Link: http://lkml.kernel.org/r/156095685930.28024.10405547027475590975.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
cb8e7a8d55 tracing/dynevent: Delete all matched events
When user gives an event name to delete, delete all
matched events instead of the first one.

This means if there are several events which have same
name but different group (subsystem) name, those are
removed if user passed only the event name, e.g.

  # cat kprobe_events
  p:group1/testevent _do_fork
  p:group2/testevent fork_idle
  # echo -:testevent >> kprobe_events
  # cat kprobe_events
  #

Link: http://lkml.kernel.org/r/156095684958.28024.16597826267117453638.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
60d53e2c3b tracing/probe: Split trace_event related data from trace_probe
Split the trace_event related data from trace_probe data structure
and introduce trace_probe_event data structure for its folder.
This trace_probe_event data structure can have multiple trace_probe.

Link: http://lkml.kernel.org/r/156095683995.28024.7552150340561557873.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
0bc11ed5ab kprobes: Allow kprobes coexist with livepatch
Allow kprobes which do not modify regs->ip, coexist with livepatch
by dropping FTRACE_OPS_FL_IPMODIFY from ftrace_ops.

User who wants to modify regs->ip (e.g. function fault injection)
must set a dummy post_handler to its kprobes when registering.
However, if such regs->ip modifying kprobes is set on a function,
that function can not be livepatched.

Link: http://lkml.kernel.org/r/156403587671.30117.5233558741694155985.stgit@devnote2

Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:10 -04:00
Linus Torvalds
95381debd9 Small fixes and minor cleanups for Tracing
- Make exported ftrace function not static
  - Fix NULL pointer dereference in reading probes as they are created
  - Fix NULL pointer dereference in k/uprobe clean up path
  - Various documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXWpTNRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpXtAPsGoHDHkgPIyl9bnV0oZfwLrAl4qEyg
 RpVp9ZMcG4UtMwEAp/SXRFzvL+EUiKyd1U3FZy2jhVec3+hX7SzIGqgONA4=
 =ee8V
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Small fixes and minor cleanups for tracing:

   - Make exported ftrace function not static

   - Fix NULL pointer dereference in reading probes as they are created

   - Fix NULL pointer dereference in k/uprobe clean up path

   - Various documentation fixes"

* tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Correct kdoc formats
  ftrace/x86: Remove mcount() declaration
  tracing/probe: Fix null pointer dereference
  tracing: Make exported ftrace_set_clr_event non-static
  ftrace: Check for successful allocation of hash
  ftrace: Check for empty hash and comment the race with registering probes
  ftrace: Fix NULL pointer dereference in t_probe_next()
2019-08-31 09:15:25 -07:00
Jakub Kicinski
c68c9ec1c5 tracing: Correct kdoc formats
Fix the following kdoc warnings:

kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single'
kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer'
kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk'

Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:56 -04:00
Xinpeng Liu
19a58ce1dc tracing/probe: Fix null pointer dereference
BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0
Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746
trace_probe_cleanup+0x8d/0xd0
free_trace_kprobe.part.14+0x15/0x50
alloc_trace_kprobe+0x23e/0x250

Link: http://lkml.kernel.org/r/1565220563-980-1-git-send-email-danielliu861@gmail.com

Fixes: e3dc9f898e ("tracing/probe: Add trace_event_call accesses APIs")
Signed-off-by: Xinpeng Liu <danielliu861@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:55 -04:00
Denis Efremov
595a438c78 tracing: Make exported ftrace_set_clr_event non-static
The function ftrace_set_clr_event is declared static and marked
EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the
function was decided to be a part of API, this commit removes the static
attribute and adds the declaration to the header.

Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com

Fixes: f45d1225ad ("tracing: Kernel access to Ftrace instances")
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:49 -04:00
David S. Miller
94880a5b2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-08-31

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix 32-bit zero-extension during constant blinding which
   has been causing a regression on ppc64, from Naveen.

2) Fix a latency bug in nfp driver when updating stack index
   register, from Jiong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 17:39:37 -07:00
Naveen N. Rao
5b0022dd32 ftrace: Check for successful allocation of hash
In register_ftrace_function_probe(), we are not checking the return
value of alloc_and_copy_ftrace_hash(). The subsequent call to
ftrace_match_records() may end up dereferencing the same. Add a check to
ensure this doesn't happen.

Link: http://lkml.kernel.org/r/26e92574f25ad23e7cafa3cf5f7a819de1832cbe.1562249521.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 1ec3a81a0c ("ftrace: Have each function probe use its own ftrace_ops")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:49:07 -04:00
Steven Rostedt (VMware)
372e0d01da ftrace: Check for empty hash and comment the race with registering probes
The race between adding a function probe and reading the probes that exist
is very subtle. It needs a comment. Also, the issue can also happen if the
probe has has the EMPTY_HASH as its func_hash.

Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:30:01 -04:00
Naveen N. Rao
7bd46644ea ftrace: Fix NULL pointer dereference in t_probe_next()
LTP testsuite on powerpc results in the below crash:

  Unable to handle kernel paging request for data at address 0x00000000
  Faulting instruction address: 0xc00000000029d800
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  ...
  CPU: 68 PID: 96584 Comm: cat Kdump: loaded Tainted: G        W
  NIP:  c00000000029d800 LR: c00000000029dac4 CTR: c0000000001e6ad0
  REGS: c0002017fae8ba10 TRAP: 0300   Tainted: G        W
  MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28022422  XER: 20040000
  CFAR: c00000000029d90c DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
  ...
  NIP [c00000000029d800] t_probe_next+0x60/0x180
  LR [c00000000029dac4] t_mod_start+0x1a4/0x1f0
  Call Trace:
  [c0002017fae8bc90] [c000000000cdbc40] _cond_resched+0x10/0xb0 (unreliable)
  [c0002017fae8bce0] [c0000000002a15b0] t_start+0xf0/0x1c0
  [c0002017fae8bd30] [c0000000004ec2b4] seq_read+0x184/0x640
  [c0002017fae8bdd0] [c0000000004a57bc] sys_read+0x10c/0x300
  [c0002017fae8be30] [c00000000000b388] system_call+0x5c/0x70

The test (ftrace_set_ftrace_filter.sh) is part of ftrace stress tests
and the crash happens when the test does 'cat
$TRACING_PATH/set_ftrace_filter'.

The address points to the second line below, in t_probe_next(), where
filter_hash is dereferenced:
  hash = iter->probe->ops.func_hash->filter_hash;
  size = 1 << hash->size_bits;

This happens due to a race with register_ftrace_function_probe(). A new
ftrace_func_probe is created and added into the func_probes list in
trace_array under ftrace_lock. However, before initializing the filter,
we drop ftrace_lock, and re-acquire it after acquiring regex_lock. If
another process is trying to read set_ftrace_filter, it will be able to
acquire ftrace_lock during this window and it will end up seeing a NULL
filter_hash.

Fix this by just checking for a NULL filter_hash in t_probe_next(). If
the filter_hash is NULL, then this probe is just being added and we can
simply return from here.

Link: http://lkml.kernel.org/r/05e021f757625cbbb006fad41380323dbe4e3b43.1562249521.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:23:47 -04:00
Will Deacon
61b7cddfe8 Merge branch 'for-next/atomics' into for-next/core
* for-next/atomics: (10 commits)
  Rework LSE instruction selection to use static keys instead of alternatives
2019-08-30 12:55:39 +01:00
Michael Ellerman
07aa1e786d Merge branch 'topic/mem-encrypt' into next
This branch has some cross-arch patches that are a prequisite for the
SVM work. They're in a topic branch in case any of the other arch
maintainers want to merge them to resolve conflicts.
2019-08-30 09:49:28 +10:00
Christoph Hellwig
8e3a68fb55 dma-mapping: make dma_atomic_pool_init self-contained
The memory allocated for the atomic pool needs to have the same
mapping attributes that we use for remapping, so use
pgprot_dmacoherent instead of open coding it.  Also deduct a
suitable zone to allocate the memory from based on the presence
of the DMA zones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-29 16:43:33 +02:00
Christoph Hellwig
419e2f1838 dma-mapping: remove arch_dma_mmap_pgprot
arch_dma_mmap_pgprot is used for two things:

 1) to override the "normal" uncached page attributes for mapping
    memory coherent to devices that can't snoop the CPU caches
 2) to provide the special DMA_ATTR_WRITE_COMBINE semantics on older
    arm systems and some mips platforms

Replace one with the pgprot_dmacoherent macro that is already provided
by arm and much simpler to use, and lift the DMA_ATTR_WRITE_COMBINE
handling to common code with an explicit arch opt-in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	# m68k
Acked-by: Paul Burton <paul.burton@mips.com>		# mips
2019-08-29 16:43:22 +02:00
Andrew Murray
8f35eaa5f2 jump_label: Don't warn on __exit jump entries
On architectures that discard .exit.* sections at runtime, a
warning is printed for each jump label that is used within an
in-kernel __exit annotated function:

can't patch jump_label at ehci_hcd_cleanup+0x8/0x3c
WARNING: CPU: 0 PID: 1 at kernel/jump_label.c:410 __jump_label_update+0x12c/0x138

As these functions will never get executed (they are free'd along
with the rest of initmem) - we do not need to patch them and should
not display any warnings.

The warning is displayed because the test required to satisfy
jump_entry_is_init is based on init_section_contains (__init_begin to
__init_end) whereas the test in __jump_label_update is based on
init_kernel_text (_sinittext to _einittext) via kernel_text_address).

Fixes: 1948367768 ("jump_label: Annotate entries that operate on __init code earlier")
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-29 15:10:10 +01:00
Thomas Gleixner
a2ed4fd685 posix-cpu-timers: Make expiry_active check actually work correctly
The state tracking changes broke the expiry active check by not writing to
it and instead sitting timers_active, which is already set.

That's not a big issue as the actual expiry is protected by sighand lock,
so concurrent handling is not possible. That means that the second task
which invokes that function executes the expiry code for nothing.

Write to the proper flag.

Also add a check whether the flag is set into check_process_timers(). That
check had been missing in the code before the rework already. The check for
another task handling the expiry of process wide timers was only done in
the fastpath check. If the fastpath check returns true because a per task
timer expired, then the checking of process wide timers was done in
parallel which is as explained above just a waste of cycles.

Fixes: 244d49e306 ("posix-cpu-timers: Move state tracking to struct posix_cputimers")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
2019-08-29 12:52:28 +02:00
Linus Torvalds
9cf6b756cd arm64 fixes for -rc7
- Fix GICv2 emulation bug (KVM)
 
 - Fix deadlock in virtual GIC interrupt injection code (KVM)
 
 - Fix kprobes blacklist init failure due to broken kallsyms lookup
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1mW/0ACgkQt6xw3ITB
 YzQWEQgAnG+20c1uzPqIEME3o0bCXsci0I1w4yPYPH/NVrfmJdAjfway/cLZlMCh
 JYKBTc392kfYXgzAvMqvjNe+9O/qg9MnMxV7Afjfk9e0k+41Kcw/7U39/B6q4yOg
 zIz4YW6SdbbbUfY9VgR8SjsZ3HwJoP8YVNEv3W259funtFHkKDO35h1iG1f57GWL
 UELjEnag9dmCH35ykt7+SC4woBGjOQnic2XLSS7VZuU/26TfnUyo2HCdDcLcViCZ
 DHDQgaHAscb+tgJjeNtNEeRu+GPBG9LOhPw4EpBjGHeoJla7z74GOJPMEt4Ufoqd
 qG9TPfOFAFM30/PAdlWF3SjfzqK3PA==
 =WYzY
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "Hot on the heels of our last set of fixes are a few more for -rc7.

  Two of them are fixing issues with our virtual interrupt controller
  implementation in KVM/arm, while the other is a longstanding but
  straightforward kallsyms fix which was been acked by Masami and
  resolves an initialisation failure in kprobes observed on arm64.

   - Fix GICv2 emulation bug (KVM)

   - Fix deadlock in virtual GIC interrupt injection code (KVM)

   - Fix kprobes blacklist init failure due to broken kallsyms lookup"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI
  KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is long
  kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol
2019-08-28 10:37:21 -07:00
Sebastian Andrzej Siewior
71fed982d6 tick: Mark sched_timer to expire in hard interrupt context
sched_timer must be initialized with the _HARD mode suffix to ensure expiry
in hard interrupt context on RT.

The previous conversion to HARD expiry mode missed on one instance in
tick_nohz_switch_to_nohz(). Fix it up.

Fixes: 902a9f9c50 ("tick: Mark tick related hrtimers to expiry in hard interrupt context")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190823113845.12125-3-bigeasy@linutronix.de
2019-08-28 13:01:26 +02:00
Ming Lei
101f85b56d genirq/affinity: Remove const qualifier from node_to_cpumask argument
When CONFIG_CPUMASK_OFFSTACK isn't enabled, 'cpumask_var_t' is as

'typedef struct cpumask cpumask_var_t[1]',

so the argument 'node_to_cpumask' alloc_nodes_vectors() can't be declared
as 'const cpumask_var_t *'

Fixes the following warning:

   kernel/irq/affinity.c: In function '__irq_build_affinity_masks':
     alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
                                  ^
   kernel/irq/affinity.c:128:13: note: expected 'const struct cpumask (*)[1]' but argument is of type 'struct cpumask (*)[1]'
    static void alloc_nodes_vectors(unsigned int numvecs,
                ^
Fixes: b1a5a73e64 ("genirq/affinity: Spread vectors on node according to nr_cpu ratio")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190828085815.19931-1-ming.lei@redhat.com
2019-08-28 12:20:43 +02:00
Thomas Gleixner
60bda037f1 posix-cpu-timers: Utilize timerqueue for storage
Using a linear O(N) search for timer insertion affects execution time and
D-cache footprint badly with a larger number of timers.

Switch the storage to a timerqueue which is already used for hrtimers and
alarmtimers. It does not affect the size of struct k_itimer as it.alarm is
still larger.

The extra list head for the expiry list will go away later once the expiry
is moved into task work context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908272129220.1939@nanos.tec.linutronix.de
2019-08-28 11:50:43 +02:00
Thomas Gleixner
244d49e306 posix-cpu-timers: Move state tracking to struct posix_cputimers
Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
8991afe264 posix-cpu-timers: Deduplicate rlimit handling
Both thread and process expiry functions have the same functionality for
sending signals for soft and hard RLIMITs duplicated in 4 different
ways.

Split it out into a common function and cleanup the callsites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.653276779@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
dd67022413 posix-cpu-timers: Remove pointless comparisons
The soft RLIMIT expiry code checks whether the soft limit is greater than
the hard limit. That's pointless because if the soft RLIMIT is greater than
the hard RLIMIT then that code cannot be reached as the hard RLIMIT check
is before that and already killed the process.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.548747613@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
8ea1de90a5 posix-cpu-timers: Get rid of 64bit divisions
Instead of dividing A to match the units of B it's more efficient to
multiply B to match the units of A.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.458286860@linutronix.de
2019-08-28 11:50:41 +02:00
Thomas Gleixner
1cd07c0b94 posix-cpu-timers: Consolidate timer expiry further
With the array based samples and expiry cache, the expiry function can use
a loop to collect timers from the clock specific lists.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.365469982@linutronix.de
2019-08-28 11:50:41 +02:00
Thomas Gleixner
2bbdbdae05 posix-cpu-timers: Get rid of zero checks
Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:

	if (cache == 0 || new < cache)
		cache = new;

Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:

	if (new < cache)
		cache = new;

This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de
2019-08-28 11:50:40 +02:00
Thomas Gleixner
24db4dd90d rlimit: Rewrite non-sensical RLIMIT_CPU comment
The comment above the function which arms RLIMIT_CPU in the posix CPU timer
code makes no sense at all. It claims that the kernel does not return an
error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
bogus as the code does an error check and the rlimit is only set and
activated when the permission checks are ok. In case of a rejection an
appropriate error code is returned.

This is a historical and outdated comment which got dragged along even when
the rlimit handling code was rewritten.

Replace it with an explanation why the setup function is not called when
the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de
2019-08-28 11:50:40 +02:00
Thomas Gleixner
fe0517f893 posix-cpu-timers: Respect INFINITY for hard RTTIME limit
The RTIME limit expiry code does not check the hard RTTIME limit for
INFINITY, i.e. being disabled.  Add it.

While this could be considered an ABI breakage if something would depend on
this behaviour. Though it's highly unlikely to have an effect because
RLIM_INFINITY is at minimum INT_MAX and the RTTIME limit is in seconds, so
the timer would fire after ~68 years.

Adding this obvious correct limit check also allows further consolidation
of that code and is a prerequisite for cleaning up the 0 based checks and
the rlimit setter code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.078293002@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
b7be4ef136 posix-cpu-timers: Switch thread group sampling to array
That allows more simplifications in various places.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
87dc64480f posix-cpu-timers: Restructure expiry array
Now that the abused struct task_cputime is gone, it's more natural to
bundle the expiry cache and the list head of each clock into a struct and
have an array of those structs.

Follow the hrtimer naming convention of 'bases' and rename the expiry cache
to 'nextevt' and adapt all usage sites.

Generates also better code .text size shrinks by 80 bytes.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908262021140.1939@nanos.tec.linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
46b883995c posix-cpu-timers: Remove cputime_expires
The last users of the magic struct cputime based expiry cache are
gone. Remove the leftovers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.790209622@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
001f797143 posix-cpu-timers: Make expiry checks array based
The expiry cache is an array indexed by clock ids. The new sample functions
allow to retrieve a corresponding array of samples.

Convert the fastpath expiry checks to make use of the new sample functions
and do the comparisons on the sample and the expiry array.

Make the check for the expiry array being zero array based as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.695481430@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
b0d524f779 posix-cpu-timers: Provide array based sample functions
Instead of using task_cputime and doing the addition of utime and stime at
all call sites, it's way simpler to have a sample array which allows
indexed based checks against the expiry cache array.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.590362974@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
c02b078e63 posix-cpu-timers: Switch check_*_timers() to array cache
Use the array based expiry cache in check_thread_timers() and convert the
store in check_process_timers() for consistency.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.408222378@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
1b0dd96d0f posix-cpu-timers: Simplify set_process_cpu_timer()
The expiry cache can now be accessed as an array. Replace the per clock
checks with a simple comparison of the clock indexed array member.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.303316423@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
3b495b22d0 posix-cpu-timers: Simplify timer queueing
Now that the expiry cache can be accessed as an array, the per clock
checking can be reduced to just comparing the corresponding array elements.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.212129449@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
11b8462f7e posix-cpu-timers: Provide array based access to expiry cache
Using struct task_cputime for the expiry cache is a pretty odd choice and
comes with magic defines to rename the fields for usage in the expiry
cache.

struct task_cputime is basically a u64 array with 3 members, but it has
distinct members.

The expiry cache content is different than the content of task_cputime
because

  expiry[PROF]  = task_cputime.stime + task_cputime.utime
  expiry[VIRT]  = task_cputime.utime
  expiry[SCHED] = task_cputime.sum_exec_runtime

So there is no direct mapping between task_cputime and the expiry cache and
the #define based remapping is just a horrible hack.

Having the expiry cache array based allows further simplification of the
expiry code.

To avoid an all in one cleanup which is hard to review add a temporary
anonymous union into struct task_cputime which allows array based access to
it. That requires to reorder the members. Add a build time sanity check to
validate that the members are at the same place.

The union and the build time checks will be removed after conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
3a245c0f11 posix-cpu-timers: Move expiry cache into struct posix_cputimers
The expiry cache belongs into the posix_cputimers container where the other
cpu timers information is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
2b69942f90 posix-cpu-timers: Create a container struct
Per task/process data of posix CPU timers is all over the place which
makes the code hard to follow and requires ifdeffery.

Create a container to hold all this information in one place, so data is
consolidated and the ifdeffery can be confined to the posix timer header
file and removed from places like fork.

As a first step, move the cpu_timers list head array into the new struct
and clean up the initializers and simplify fork. The remaining #ifdef in
fork will be removed later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner
ab693c5a5e posix-cpu-timers: Move prof/virt_ticks into caller
The functions have only one caller left. No point in having them.

Move the almost duplicated code into the caller and simplify it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.729298382@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner
0476ff2c15 posix-cpu-timers: Sample task times once in expiry check
Sampling the task times twice does not make sense. Do it once.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.639878168@linutronix.de
2019-08-28 11:50:32 +02:00
Thomas Gleixner
8c2d74f037 posix-cpu-timers: Get rid of pointer indirection
Now that the sample functions have no return value anymore, the result can
simply be returned instead of using pointer indirection.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.535079278@linutronix.de
2019-08-28 11:50:32 +02:00
Thomas Gleixner
2092c1d4fe posix-cpu-timers: Simplify sample functions
All callers hand in a valdiated clock id. Remove the return value which was
unchecked in most places anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.430475832@linutronix.de
2019-08-28 11:50:31 +02:00
Thomas Gleixner
5405d0051f posix-cpu-timers: Remove pointless return value check
set_process_cpu_timer() checks already whether the clock id is valid. No
point in checking the return value of the sample function. That allows to
simplify the sample function later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.339725769@linutronix.de
2019-08-28 11:50:30 +02:00
Thomas Gleixner
da020ce406 posix-cpu-timers: Use clock ID in posix_cpu_timer_rearm()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.245357769@linutronix.de
2019-08-28 11:50:29 +02:00
Thomas Gleixner
99093c5b81 posix-cpu-timers: Use clock ID in posix_cpu_timer_get()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.155487201@linutronix.de
2019-08-28 11:50:29 +02:00
Thomas Gleixner
c7a37c6f4c posix-cpu-timers: Use clock ID in posix_cpu_timer_set()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.050770464@linutronix.de
2019-08-28 11:50:28 +02:00
Thomas Gleixner
24ab7f5a7b posix-cpu-timers: Consolidate thread group sample code
cpu_clock_sample_group() and cpu_timer_sample_group() are almost the
same. Before the rename one called thread_group_cputimer() and the other
thread_group_cputime(). Really intuitive function names.

Consolidate the functions and also avoid the thread traversal when
the thread group's accounting is already active.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.960966884@linutronix.de
2019-08-28 11:50:28 +02:00
Thomas Gleixner
c506bef424 posix-cpu-timers: Rename thread_group_cputimer() and make it static
thread_group_cputimer() is a complete misnomer. The function does two things:

 - For arming process wide timers it makes sure that the atomic time
   storage is up to date. If no cpu timer is armed yet, then the atomic
   time storage is not updated by the scheduler for performance reasons.

   In that case a full summing up of all threads needs to be done and the
   update needs to be enabled.

- Samples the current time into the caller supplied storage.

Rename it to thread_group_start_cputime(), make it static and fixup the
callsite.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner
a324956fae posix-cpu-timers: Sample directly in timer check
The thread group accounting is active, otherwise the expiry function would
not be running. Sample the thread group time directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.780348088@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner
a34360d424 itimers: Use quick sample function
get_itimer() locks sighand lock and checks whether the timer is already
expired. If it is not expired then the thread group cputime accounting is
already enabled. Use the sampling function not the one which is meant for
starting a timer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.689713638@linutronix.de
2019-08-28 11:50:26 +02:00
Thomas Gleixner
19298fbf45 posix-cpu-timers: Provide quick sample function for itimer
get_itimer() needs a sample of the current thread group cputime. It invokes
thread_group_cputimer() - which is a misnomer. That function also starts
eventually the group cputime accouting which is bogus because the
accounting is already active when a timer is armed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-08-28 11:50:26 +02:00
Thomas Gleixner
e5a8b65b4c posix-cpu-timers: Use common permission check in posix_cpu_timer_create()
Yet another copy of the same thing gone...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.505833418@linutronix.de
2019-08-28 11:50:25 +02:00
Thomas Gleixner
bfcf3e92c6 posix-cpu-timers: Use common permission check in posix_cpu_clock_get()
Replace the next slightly different copy of permission checks. That also
removes the necessarity to check the return value of the sample functions
because the clock id is already validated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.414813172@linutronix.de
2019-08-28 11:50:25 +02:00
Thomas Gleixner
6ae40e3fdc posix-cpu-timers: Provide task validation functions
The code contains three slightly different copies of validating whether a
given clock resolves to a valid task and whether the current caller has
permissions to access it.

Create central functions. Replace check_clock() as a first step and rename
it to something sensible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192919.326097175@linutronix.de
2019-08-28 11:50:24 +02:00
Alexander Shishkin
ab43762ef0 perf: Allow normal events to output AUX data
In some cases, ordinary (non-AUX) events can generate data for AUX events.
For example, PEBS events can come out as records in the Intel PT stream
instead of their usual DS records, if configured to do so.

One requirement for such events is to consistently schedule together, to
ensure that the data from the "AUX output" events isn't lost while their
corresponding AUX event is not scheduled. We use grouping to provide this
guarantee: an "AUX output" event can be added to a group where an AUX event
is a group leader, and provided that the former supports writing to the
latter.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kan.liang@linux.intel.com
Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com
2019-08-28 11:29:38 +02:00
Douglas RAILLARD
77c84dd188 sched/cpufreq: Align trace event behavior of fast switching
Fast switching path only emits an event for the CPU of interest, whereas the
regular path emits an event for all the CPUs that had their frequency changed,
i.e. all the CPUs sharing the same policy.

With the current behavior, looking at cpu_frequency event for a given CPU that
is using the fast switching path will not give the correct frequency signal.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-28 11:26:46 +02:00
Alexei Starovoitov
10d274e880 bpf: introduce verifier internal test flag
Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain
and state pruning.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-28 00:30:11 +02:00
David S. Miller
68aaf44595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in r8169, bug fix had two versions in net
and net-next, take the net-next hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 14:23:31 -07:00
Linus Torvalds
452a04441b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
    Leoshkevich.

 2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
    SMC from Jason Baron.

 3) ipv6_mc_may_pull() should return 0 for malformed packets, not
    -EINVAL. From Stefano Brivio.

 4) Don't forget to unpin umem xdp pages in error path of
    xdp_umem_reg(). From Ivan Khoronzhuk.

 5) Fix sta object leak in mac80211, from Johannes Berg.

 6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
    switches. From Florian Fainelli.

 7) Revert DMA sync removal from r8169 which was causing regressions on
    some MIPS Loongson platforms. From Heiner Kallweit.

 8) Use after free in flow dissector, from Jakub Sitnicki.

 9) Fix NULL derefs of net devices during ICMP processing across
    collect_md tunnels, from Hangbin Liu.

10) proto_register() memory leaks, from Zhang Lin.

11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
    from John Fastabend.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
  r8152: Set memory to all 0xFFs on failed reg reads
  openvswitch: Fix conntrack cache with timeout
  ipv4: mpls: fix mpls_xmit for iptunnel
  nexthop: Fix nexthop_num_path for blackhole nexthops
  net: rds: add service level support in rds-info
  net: route dump netlink NLM_F_MULTI flag missing
  s390/qeth: reject oversized SNMP requests
  sock: fix potential memory leak in proto_register()
  MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
  xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
  ipv4/icmp: fix rt dst dev null pointer dereference
  openvswitch: Fix log message in ovs conntrack
  bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
  bpf: fix use after free in prog symbol exposure
  bpf: fix precision tracking in presence of bpf2bpf calls
  flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
  Revert "r8169: remove not needed call to dma_sync_single_for_device"
  ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
  net/ncsi: Fix the payload copying for the request coming from Netlink
  qed: Add cleanup in qed_slowpath_start()
  ...
2019-08-27 10:12:48 -07:00
Marc Zyngier
2a1a3fa0f2 kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol
An arm64 kernel configured with

  CONFIG_KPROBES=y
  CONFIG_KALLSYMS=y
  # CONFIG_KALLSYMS_ALL is not set
  CONFIG_KALLSYMS_BASE_RELATIVE=y

reports the following kprobe failure:

  [    0.032677] kprobes: failed to populate blacklist: -22
  [    0.033376] Please take care of using kprobes.

It appears that kprobe fails to retrieve the symbol at address
0xffff000010081000, despite this symbol being in System.map:

  ffff000010081000 T __exception_text_start

This symbol is part of the first group of aliases in the
kallsyms_offsets array (symbol names generated using ugly hacks in
scripts/kallsyms.c):

  kallsyms_offsets:
          .long   0x1000 // do_undefinstr
          .long   0x1000 // efi_header_end
          .long   0x1000 // _stext
          .long   0x1000 // __exception_text_start
          .long   0x12b0 // do_cp15instr

Looking at the implementation of get_symbol_pos(), it returns the
lowest index for aliasing symbols. In this case, it return 0.

But kallsyms_lookup_size_offset() considers 0 as a failure, which
is obviously wrong (there is definitely a valid symbol living there).
In turn, the kprobe blacklisting stops abruptly, hence the original
error.

A CONFIG_KALLSYMS_ALL kernel wouldn't fail as there is always
some random symbols at the beginning of this array, which are never
looked up via kallsyms_lookup_size_offset.

Fix it by considering that get_symbol_pos() is always successful
(which is consistent with the other uses of this function).

Fixes: ffc5089196 ("[PATCH] Create kallsyms_lookup_size_offset()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-27 16:19:56 +01:00
Ming Lei
b1a5a73e64 genirq/affinity: Spread vectors on node according to nr_cpu ratio
Now __irq_build_affinity_masks() spreads vectors evenly per node, but there
is a case that not all vectors have been spread when each numa node has a
different number of CPUs which triggers the warning in the spreading code.

Improve the spreading algorithm by

 - assigning vectors according to the ratio of the number of CPUs on a node
   to the number of remaining CPUs.

 - running the assignment from smaller nodes to bigger nodes to guarantee
   that every active node gets allocated at least one vector.

This ensures that all vectors are spread out. Asided of that the spread
becomes more fair if the nodes have different number of CPUs.

For example, on the following machine:
	CPU(s):              16
	On-line CPU(s) list: 0-15
	Thread(s) per core:  1
	Core(s) per socket:  8
	Socket(s):           2
	NUMA node(s):        2
	...
	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
	NUMA node1 CPU(s):   2,4,10,12

When a driver requests to allocate 8 vectors, the following spread results:

	irq 31, cpu list 2,4
	irq 32, cpu list 10,12
	irq 33, cpu list 0-1
	irq 34, cpu list 3,5
	irq 35, cpu list 6-7
	irq 36, cpu list 8-9
	irq 37, cpu list 11,13
	irq 38, cpu list 14-15

So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original
algorithm assigned 4 vectors on each node which was unfair versus Node 0.

[ tglx: Massaged changelog ]

Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-27 16:31:17 +02:00
Ming Lei
53c1788b7d genirq/affinity: Improve __irq_build_affinity_masks()
One invariant of __irq_build_affinity_masks() is that all CPUs in the
specified masks (cpu_mask AND node_to_cpumask for each node) should be
covered during the spread. Even though all requested vectors have been
reached, it's still required to spread vectors among remained CPUs. A
similar policy has been taken in case of 'numvecs <= nodes' already.

So remove the following check inside the loop:

	if (done >= numvecs)
		break;

Meantime assign at least 1 vector for remaining nodes if 'numvecs' vectors
have been handled already.

Also, if the specified cpumask for one numa node is empty, simply do not
spread vectors on this node.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190816022849.14075-2-ming.lei@redhat.com
2019-08-27 16:31:17 +02:00
Naveen N. Rao
ede7c460b1 bpf: handle 32-bit zext during constant blinding
Since BPF constant blinding is performed after the verifier pass, the
ALU32 instructions inserted for doubleword immediate loads don't have a
corresponding zext instruction. This is causing a kernel oops on powerpc
and can be reproduced by running 'test_cgroup_storage' with
bpf_jit_harden=2.

Fix this by emitting BPF_ZEXT during constant blinding if
prog->aux->verifier_zext is set.

Fixes: a4b1d3c1dd ("bpf: verifier: insert zero extension according to analysis result")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-26 23:05:01 +02:00
Linus Torvalds
5a13fc3d8b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping fix from Thomas Gleixner:
 "A single fix for a regression caused by the generic VDSO
  implementation where a math overflow causes CLOCK_BOOTTIME to become a
  random number generator"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25 10:08:01 -07:00
Linus Torvalds
8a04c2ee62 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Handle the worker management in situations where a task is scheduled
  out on a PI lock contention correctly and schedule a new worker if
  possible"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Schedule new worker even if PI-blocked
2019-08-25 10:06:12 -07:00
Linus Torvalds
05bbb9360a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Two small fixes for kprobes and perf:

   - Prevent a deadlock in kprobe_optimizer() causes by reverse lock
     ordering

   - Fix a comment typo"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix potential deadlock in kprobe_optimizer()
  perf/x86: Fix typo in comment
2019-08-25 10:03:32 -07:00
Linus Torvalds
44c471e436 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "A single fix for a imbalanced kobject operation in the irq decriptor
  code which was unearthed by the new warnings in the kobject code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Properly pair kobject_del() with kobject_add()
2019-08-25 10:00:21 -07:00
Linus Torvalds
f47edb59bb Merge branch 'akpm' (patches from Andrew)
Mergr misc fixes from Andrew Morton:
 "11 fixes"

Mostly VM fixes, one psi polling fix, and one parisc build fix.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
  mm/zsmalloc.c: fix race condition in zs_destroy_pool
  mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
  mm, page_owner: handle THP splits correctly
  userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
  psi: get poll_work to run when calling poll syscall next time
  mm: memcontrol: flush percpu vmevents before releasing memcg
  mm: memcontrol: flush percpu vmstats before releasing memcg
  parisc: fix compilation errrors
  mm, page_alloc: move_freepages should not examine struct page of reserved memory
  mm/z3fold.c: fix race between migration and destruction
2019-08-25 09:56:27 -07:00
Linus Torvalds
e67095fd2f dma-mapping fixes for 5.3-rc
Two fixes for regressions in this merge window:
 
  - select the Kconfig symbols for the noncoherent dma arch helpers
    on arm if swiotlb is selected, not just for LPAE to not break then
    Xen build, that uses swiotlb indirectly through swiotlb-xen
  - fix the page allocator fallback in dma_alloc_contiguous if the CMA
    allocation fails
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1hvn4LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYON4w//Recfoy5T2Q4Gfjp1xVKGbr2sP7J93Vs7VCyQNZmX
 PrtzhmNKs4gxCEXVgHm+GVA+IJwQFqDtSFaPb8q3GQ+qM9NUDF4ScMFpfrLZsFr1
 dorm5kC1xcwrQtWjS1CQS/Gj0VBtWiMQOoUcAESMqgBIUo4ssj3Ny+vnh8hWgAOs
 oVDgOM4wt35bW0Pv/iY44uQzOq7xcYJUUYtPIiP9vMDrhPsxe6D1DgFQ4HZKJWix
 uS3BjZnsZDnLltXM/0CKdRV9wLF+jHYP/wJTztksRlr/A5V3FJ8lJIvgphxG1v3J
 tDfQs4BNuGWBjqdg+Qo6qOPEL9krvVYYVVql93DXwtPK/cJW1Z+0glgC2rbbHmIy
 ew35DFnYm9v0sFLZnbpuoHd6sQ9G59nTZstkqt/Z/hldBvKotwBpeuILAcMC9Nlw
 3iYW6Sz5L7cmkifC8OvopKKJWVoW5rVtMrVQw5niBiZVERtWbY825r/7ju2xYhZC
 iSAaUHT5wNtXsXQOTrFQ5LzTDBtgGyXRXgvNagEHhBf120jBQfOhvOCVT2HHOxdy
 5vx7xeeRS0M2HpxIsmd3XQjIUQEY9x1to4FKiYczGM1kcKeyWWBMFOXfLxe2Rmhg
 h14lbfsAxIEWdFkJAVFhjyjzC6IzxyVGtHCxw1iw0VgGzYATO/K6Oo8T2hG3HagR
 abQ=
 =DXk9
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Two fixes for regressions in this merge window:

   - select the Kconfig symbols for the noncoherent dma arch helpers on
     arm if swiotlb is selected, not just for LPAE to not break then Xen
     build, that uses swiotlb indirectly through swiotlb-xen

   - fix the page allocator fallback in dma_alloc_contiguous if the CMA
     allocation fails"

* tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: fix zone selection after an unaddressable CMA allocation
  arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24 20:00:11 -07:00
Jason Xing
7b2b55da1d psi: get poll_work to run when calling poll syscall next time
Only when calling the poll syscall the first time can user receive
POLLPRI correctly.  After that, user always fails to acquire the event
signal.

Reproduce case:
 1. Get the monitor code in Documentation/accounting/psi.txt
 2. Run it, and wait for the event triggered.
 3. Kill and restart the process.

The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0).  And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker.  The
cancel needs to pair with resetting the poll_scheduled flag.

Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
David S. Miller
211c462452 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-08-24

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.

2) Fix a use-after-free in prog symbol exposure, from Daniel.

3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.

4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.

5) Fix a potential use-after-free on flow dissector detach, from Jakub.

6) Fix bpftool to close prog fd after showing metadata, from Quentin.

7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23 17:34:11 -07:00
Daniel Borkmann
c751798aa2 bpf: fix use after free in prog symbol exposure
syzkaller managed to trigger the warning in bpf_jit_free() which checks via
bpf_prog_kallsyms_verify_off() for potentially unlinked JITed BPF progs
in kallsyms, and subsequently trips over GPF when walking kallsyms entries:

  [...]
  8021q: adding VLAN 0 to HW filter on device batadv0
  8021q: adding VLAN 0 to HW filter on device batadv0
  WARNING: CPU: 0 PID: 9869 at kernel/bpf/core.c:810 bpf_jit_free+0x1e8/0x2a0
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: events bpf_prog_free_deferred
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x113/0x167 lib/dump_stack.c:113
   panic+0x212/0x40b kernel/panic.c:214
   __warn.cold.8+0x1b/0x38 kernel/panic.c:571
   report_bug+0x1a4/0x200 lib/bug.c:186
   fixup_bug arch/x86/kernel/traps.c:178 [inline]
   do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
   do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
  RIP: 0010:bpf_jit_free+0x1e8/0x2a0
  Code: 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 86 00 00 00 48 ba 00 02 00 00 00 00 ad de 0f b6 43 02 49 39 d6 0f 84 5f fe ff ff <0f> 0b e9 58 fe ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1
  RSP: 0018:ffff888092f67cd8 EFLAGS: 00010202
  RAX: 0000000000000007 RBX: ffffc90001947000 RCX: ffffffff816e9d88
  RDX: dead000000000200 RSI: 0000000000000008 RDI: ffff88808769f7f0
  RBP: ffff888092f67d00 R08: fffffbfff1394059 R09: fffffbfff1394058
  R10: fffffbfff1394058 R11: ffffffff89ca02c7 R12: ffffc90001947002
  R13: ffffc90001947020 R14: ffffffff881eca80 R15: ffff88808769f7e8
  BUG: unable to handle kernel paging request at fffffbfff400d000
  #PF error: [normal kernel read fault]
  PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 9f942067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: events bpf_prog_free_deferred
  RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:495 [inline]
  RIP: 0010:bpf_tree_comp kernel/bpf/core.c:558 [inline]
  RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
  RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
  RIP: 0010:bpf_prog_kallsyms_find+0x107/0x2e0 kernel/bpf/core.c:632
  Code: 00 f0 ff ff 44 38 c8 7f 08 84 c0 0f 85 fa 00 00 00 41 f6 45 02 01 75 02 0f 0b 48 39 da 0f 82 92 00 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 45 01 00 00 8b 03 48 c1 e0
  [...]

Upon further debugging, it turns out that whenever we trigger this
issue, the kallsyms removal in bpf_prog_ksym_node_del() was /skipped/
but yet bpf_jit_free() reported that the entry is /in use/.

Problem is that symbol exposure via bpf_prog_kallsyms_add() but also
perf_event_bpf_event() were done /after/ bpf_prog_new_fd(). Once the
fd is exposed to the public, a parallel close request came in right
before we attempted to do the bpf_prog_kallsyms_add().

Given at this time the prog reference count is one, we start to rip
everything underneath us via bpf_prog_release() -> bpf_prog_put().
The memory is eventually released via deferred free, so we're seeing
that bpf_jit_free() has a kallsym entry because we added it from
bpf_prog_load() but /after/ bpf_prog_put() from the remote CPU.

Therefore, move both notifications /before/ we install the fd. The
issue was never seen between bpf_prog_alloc_id() and bpf_prog_new_fd()
because upon bpf_prog_get_fd_by_id() we'll take another reference to
the BPF prog, so we're still holding the original reference from the
bpf_prog_load().

Fixes: 6ee52e2a3f ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
Fixes: 74451e66d5 ("bpf: make jited programs visible in traces")
Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Song Liu <songliubraving@fb.com>
2019-08-24 01:17:47 +02:00
Alexei Starovoitov
6754172c20 bpf: fix precision tracking in presence of bpf2bpf calls
While adding extra tests for precision tracking and extra infra
to adjust verifier heuristics the existing test
"calls: cross frame pruning - liveness propagation" started to fail.
The root cause is the same as described in verifer.c comment:

 * Also if parent's curframe > frame where backtracking started,
 * the verifier need to mark registers in both frames, otherwise callees
 * may incorrectly prune callers. This is similar to
 * commit 7640ead939 ("bpf: verifier: make sure callees don't prune with caller differences")
 * For now backtracking falls back into conservative marking.

Turned out though that returning -ENOTSUPP from backtrack_insn() and
doing mark_all_scalars_precise() in the current parentage chain is not enough.
Depending on how is_state_visited() heuristic is creating parentage chain
it's possible that callee will incorrectly prune caller.
Fix the issue by setting precise=true earlier and more aggressively.
Before this fix the precision tracking _within_ functions that don't do
bpf2bpf calls would still work. Whereas now precision tracking is completely
disabled when bpf2bpf calls are present anywhere in the program.

No difference in cilium tests (they don't have bpf2bpf calls).
No difference in test_progs though some of them have bpf2bpf calls,
but precision tracking wasn't effective there.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-24 01:17:12 +02:00
Linus Torvalds
e3fb13b7e4 Modules fixes for v5.3-rc6
- Fix BUG_ON() being triggered in frob_text() due to non-page-aligned module
   sections
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdX/7kAAoJEMBFfjjOO8Fy/l0P/j7S2HY31eD+ol5vsohp2rWv
 CDDNiDlDWKkmCNzQy1giynbH5EVOZCZyoLcQgVTMJKHoRVeS/Mo9BU9FU4ZW2WWK
 DxoCUN4WrCvhCrt5MLiZEkrAgIyYIoGeiHTRyBYSv7PrxFNxSWuQQQZ7IRssWHSQ
 1sbf1rt9xX6uddQyCaZlCTzPrc2M6v2YVMFR2Wq5Ts8qazbtf6ZoUXfg6vYMc/bm
 k55aJPKmq/HEfifMZxrwRPIPD+v2MddZ8R6ti3QSWMbzSSNiUR8rlJPIa7GQy5ia
 Xbut/a9Q8/2qR81D2Mkcw1xlff7QI0i55JAmJ8OgkXA55Xyv+TSzW/RJRmCanYAo
 gj4TCncbtAj8GYL+R309rhPjG222/GA42/Gosugdfa15ojn/LI9g1XjOdyPpT38I
 RpnhL4kxzpCbUnWX++fB8tmMk0pQdlsorhTcA+y8zyFAifdf6KC7Fw7SEpKWtPjH
 gbOkjgjpFo6JOS9ywAeoMet5BQ0lngGlS3iLdQA/zOr0q3JWTI3BmXa3DrREnJSs
 6R+alYUAKAbwolB4yLvlavfcjIJTUvpP++VZHR+xJo51t7onkv8a1T8844Y6xVaq
 WItpudredat7m+dZEd4cYmbMmEzn8Z6hvbLgzJOOv2ZSFeURuZlK2Xsbz8Yxdfeu
 sGjDYiKQOaV0VmI/l8fY
 =oXjz
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules fixes from Jessica Yu:
 "Fix BUG_ON() being triggered in frob_text() due to non-page-aligned
  module sections"

* tag 'modules-for-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  modules: page-align module section allocations only for arches supporting strict module rwx
  modules: always page-align module section allocations
2019-08-23 09:22:00 -07:00
Thomas Gleixner
b99328a60a timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
nanoseconds based boot time offset left by the clocksource shift. That
overflows once the boot time offset becomes large enough. As a consequence
CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
misbehave.

Fix it by storing a timespec64 representation of the offset when boot time
is adjusted and add that to the MONOTONIC base time value in the vdso data
page. Using the timespec64 representation avoids a 64bit division in the
update code.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Reported-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de
2019-08-23 02:12:11 +02:00
Ingo Molnar
6c06b66e95 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

 - A few more RCU flavor consolidation cleanups.

 - Miscellaneous fixes.

 - Updates to RCU's list-traversal macros improving lockdep usability.

 - Torture-test updates.

 - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
   incoming callbacks during grace-period waits.

 - Forward-progress improvements for no-CBs CPUs: Use ->cblist
   structure to take advantage of others' grace periods.

 - Also added a small commit that avoids needlessly inflicting
   scheduler-clock ticks on callback-offloaded CPUs.

 - Forward-progress improvements for no-CBs CPUs: Reduce contention
   on ->nocb_lock guarding ->cblist.

 - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
   list to further reduce contention on ->nocb_lock guarding ->cblist.

 - LKMM updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-22 20:52:04 +02:00
Jason Gunthorpe
daa138a58c Merge branch 'odp_fixes' into hmm.git
From rdma.git

Jason Gunthorpe says:

====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================

The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.

* odp_fixes:
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/core: Make invalidate_range a device operation
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Make the three ways to create a umem_odp clear
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:18 -03:00
Thomas Gleixner
dce3e8fd03 posix-cpu-timers: Remove tsk argument from run_posix_cpu_timers()
It's always current. Don't give people wrong ideas.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.945469967@linutronix.de
2019-08-21 20:27:16 +02:00
Thomas Gleixner
692117c1f7 posix-cpu-timers: Sanitize bogus WARNONS
Warning when p == NULL and then proceeding and dereferencing p does not
make any sense as the kernel will crash with a NULL pointer dereference
right away.

Bailing out when p == NULL and returning an error code does not cure the
underlying problem which caused p to be NULL. Though it might allow to
do proper debugging.

Same applies to the clock id check in set_process_cpu_timer().

Clean them up and make them return without trying to do further damage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.846497772@linutronix.de
2019-08-21 20:27:15 +02:00
Peter Wu
5cbd22c179 bpf: clarify description for CONFIG_BPF_EVENTS
PERF_EVENT_IOC_SET_BPF supports uprobes since v4.3, and tracepoints
since v4.7 via commit 04a22fae4c ("tracing, perf: Implement BPF
programs attached to uprobes"), and commit 98b5c2c65c ("perf, bpf:
allow bpf programs attach to tracepoints") respectively.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-21 10:17:24 -07:00
Julien Grall
68b2c8c1e4 hrtimer: Don't take expiry_lock when timer is currently migrated
migration_base is used as a placeholder when an hrtimer is migrated to a
different CPU. In the case that hrtimer_cancel_wait_running() hits a timer
which is currently migrated it would pointlessly acquire the expiry lock of
the migration base, which is even not initialized.

Surely it could be initialized, but there is absolutely no point in
acquiring this lock because the timer is guaranteed not to run it's
callback for which the caller waits to finish on that base. So it would
just do the inc/lock/dec/unlock dance for nothing.

As the base switch is short and non-preemptible, there is no issue when the
wait function returns immediately.

The timer base and base->cpu_base cannot be NULL in the code path which is
invoking that, so just replace those checks with a check whether base is
migration base.

[ tglx: Updated from RT patch. Massaged changelog. Added comment. ]

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-4-julien.grall@arm.com
2019-08-21 16:10:01 +02:00
Julien Grall
dd2261ed45 hrtimer: Protect lockless access to timer->base
The update to timer->base is protected by the base->cpu_base->lock().
However, hrtimer_cancel_wait_running() does access it lockless.  So the
compiler is allowed to refetch timer->base which can cause havoc when the
timer base is changed concurrently.

Use READ_ONCE() to prevent this.

[ tglx: Adapted from a RT patch ]

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-2-julien.grall@arm.com
2019-08-21 16:10:01 +02:00
Santosh Sivaraj
49ec9177b8 extable: Add function to search only kernel exception table
Certain architecture specific operating modes (e.g., in powerpc machine
check handler that is unable to access vmalloc memory), the
search_exception_tables cannot be called because it also searches the
module exception tables if entry is not found in the kernel exception
table.

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-5-santosh@fossix.org
2019-08-21 22:23:48 +10:00
He Zhe
3b5be16c7e modules: page-align module section allocations only for arches supporting strict module rwx
We should keep the case of "#define debug_align(X) (X)" for all arches
without CONFIG_HAS_STRICT_MODULE_RWX ability, which would save people, who
are sensitive to system size, a lot of memory when using modules,
especially for embedded systems. This is also the intention of the
original #ifdef... statement and still valid for now.

Note that this still keeps the effect of the fix of the following commit,
38f054d549 ("modules: always page-align module section allocations"),
since when CONFIG_ARCH_HAS_STRICT_MODULE_RWX is enabled, module pages are
aligned.

Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-08-21 10:43:56 +02:00
Amit Kucheria
c3082a674f PM: QoS: Get rid of unused flags
The network_latency and network_throughput flags for PM-QoS have not
found much use in drivers or in userspace since they were introduced.

Commit 4a733ef1be ("mac80211: remove PM-QoS listener") removed the
only user PM_QOS_NETWORK_LATENCY in the kernel a while ago and there
don't seem to be any userspace tools using the character device files
either.

PM_QOS_MEMORY_BANDWIDTH was never even added to the trace events.

Remove all the flags except cpu_dma_latency.

Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:38:54 +02:00
Tri Vo
c8377adfa7 PM / wakeup: Show wakeup sources stats in sysfs
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.

Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:20:40 +02:00
Tri Vo
2434aea58e PM / wakeup: Use wakeup_source_register() in wakelock.c
kernel/power/wakelock.c duplicates wakeup source creation and
registration code from drivers/base/power/wakeup.c.

Change struct wakelock's wakeup source to a pointer and use
wakeup_source_register() function to create and register said wakeup
source. Use wakeup_source_unregister() on cleanup path.

Signed-off-by: Tri Vo <trong@android.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:20:40 +02:00
Christoph Hellwig
90ae409f9e dma-direct: fix zone selection after an unaddressable CMA allocation
The new dma_alloc_contiguous hides if we allocate CMA or regular
pages, and thus fails to retry a ZONE_NORMAL allocation if the CMA
allocation succeeds but isn't addressable.  That means we either fail
outright or dip into a small zone that might not succeed either.

Thanks to Hillf Danton for debugging this issue.

Fixes: b1d2dc009d ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
2019-08-21 07:14:10 +09:00
Thomas Gleixner
7cb9a94c15 posix-cpu-timers: Fixup stale comment
The comment above cleanup_timers() is outdated. The timers are only removed
from the task/process list heads but not modified in any other way.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.747233612@linutronix.de
2019-08-20 22:09:53 +02:00
Frederic Weisbecker
0bee3b601b hrtimer: Improve comments on handling priority inversion against softirq kthread
The handling of a priority inversion between timer cancelling and a a not
well defined possible preemption of softirq kthread is not very clear.

Especially in the posix timers side it's unclear why there is a specific RT
wait callback.

All the nice explanations can be found in the initial changelog of
f61eff83ce (hrtimer: Prepare support for PREEMPT_RT").

Extract the detailed informations from there and put it into comments.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190820132656.GC2093@lenoir
2019-08-20 22:05:46 +02:00
Thomas Gleixner
ec8f954a40 posix-timers: Use a callback for cancel synchronization on PREEMPT_RT
Posix timer delete retry loops are affected by the same priority inversion
and live lock issues as the other timers.
    
Provide a RT specific synchronization function which keeps a reference to
the timer by holding rcu read lock to prevent the timer from being freed,
dropping the timer lock and invoking the timer specific wait function via a
new callback.
    
This does not yet cover posix CPU timers because they need more special
treatment on PREEMPT_RT.

[ This is folded into the original attempt which did not use a callback. ]

Originally-by: Anna-Maria Gleixenr <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.656864506@linutronix.de
2019-08-20 22:05:46 +02:00
Catalin Marinas
3e91ec89f5 arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused arguments
Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and
arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored
for future extensions.

Acked-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-20 18:17:55 +01:00
Quentin Monnet
1b9ed84ecf bpf: add new BPF_BTF_GET_NEXT_ID syscall command
Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-20 09:51:06 -07:00
Quentin Monnet
3481e64bbe bpf: add BTF ids in procfs for file descriptors to BTF objects
Implement the show_fdinfo hook for BTF FDs file operations, and make it
print the id of the BTF object. This allows for a quick retrieval of the
BTF id from its FD; or it can help understanding what type of object
(BTF) the file descriptor points to.

v2:
- Do not expose data_size, only btf_id, in FD info.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-20 16:19:12 +02:00
YueHaibing
ede6bc88d6 bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-20 16:03:52 +02:00
Christoph Hellwig
6869b7b206 memremap: provide a not device managed memremap_pages
The kvmppc ultravisor code wants a device private memory pool that is
system wide and not attached to a device.  Instead of faking up one
provide a low-level memremap_pages for it.

Link: https://lore.kernel.org/r/20190818090557.17853-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
6f42193fd8 memremap: don't use a separate devm action for devmap_managed_enable_get
Just clean up for early failures and then piggy back on
devm_memremap_pages_release.  This helps with a pending not device managed
version of devm_memremap_pages.

Link: https://lore.kernel.org/r/20190818090557.17853-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
fdc029b19d memremap: remove the dev field in struct dev_pagemap
The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have.  Just remove the field and thus
the name in those two messages.

Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
0c38519039 resource: add a not device managed request_free_mem_region variant
Factor out the guts of devm_request_free_mem_region so that we can
implement both a device managed and a manually release version as tiny
wrappers around it.

Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:39:41 -03:00
Jason Gunthorpe
c7d8b7824f hmm: use mmu_notifier_get/put for 'struct hmm'
This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.

mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.

It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.

The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.

Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Matthew Garrett
29d3c1c8df kexec: Allow kexec_file() with appropriate IMA policy when locked down
Systems in lockdown mode should block the kexec of untrusted kernels.
For x86 and ARM we can ensure that a kernel is trustworthy by validating
a PE signature, but this isn't possible on other architectures. On those
platforms we can use IMA digital signatures instead. Add a function to
determine whether IMA has or will verify signatures for a given event type,
and if so permit kexec_file() even if the kernel is otherwise locked down.
This is restricted to cases where CONFIG_INTEGRITY_TRUSTED_KEYRING is set
in order to prevent an attacker from loading additional keys at runtime.

Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: linux-integrity@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
b0c8fdc7fd lockdown: Lock down perf when in confidentiality mode
Disallow the use of certain perf facilities that might allow userspace to
access kernel data.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
9d1f8be5cf bpf: Restrict bpf when kernel lockdown is in confidentiality mode
bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
a94549dd87 lockdown: Lock down tracing and perf kprobes when in confidentiality mode
Disallow the creation of perf and ftrace kprobes when the kernel is
locked down in confidentiality mode by preventing their registration.
This prevents kprobes from being used to access kernel memory to steal
crypto data, but continues to allow the use of kprobes from signed
modules.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: davem@davemloft.net
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
20657f66ef lockdown: Lock down module params that specify hardware parameters (eg. ioport)
Provided an annotation for module parameters that specify hardware
parameters (such as io ports, iomem addresses, irqs, dma channels, fixed
dma buffers and other types).

Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
Josh Boyer
38bd94b8a1 hibernate: Disable when the kernel is locked down
There is currently no way to verify the resume image when returning
from hibernate.  This might compromise the signed modules trust model,
so until we can work with signed hibernate images we disable it when the
kernel is locked down.

Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: rjw@rjwysocki.net
Cc: pavel@ucw.cz
cc: linux-pm@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Jiri Bohac
155bdd30af kexec_file: Restrict at runtime if the kernel is locked down
When KEXEC_SIG is not enabled, kernel should not load images through
kexec_file systemcall if the kernel is locked down.

[Modified by David Howells to fit with modifications to the previous patch
 and to return -EPERM if the kernel is locked down for consistency with
 other lockdowns. Modified by Matthew Garrett to remove the IMA
 integration, which will be replaced by integrating with the IMA
 architecture policy patches.]

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Jiri Bohac
99d5cadfde kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE
This is a preparatory patch for kexec_file_load() lockdown.  A locked down
kernel needs to prevent unsigned kernel images from being loaded with
kexec_file_load().  Currently, the only way to force the signature
verification is compiling with KEXEC_VERIFY_SIG.  This prevents loading
usigned images even when the kernel is not locked down at runtime.

This patch splits KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE.
Analogous to the MODULE_SIG and MODULE_SIG_FORCE for modules, KEXEC_SIG
turns on the signature verification but allows unsigned images to be
loaded.  KEXEC_SIG_FORCE disallows images without a valid signature.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Matthew Garrett
7d31f4602f kexec_load: Disable at runtime if the kernel is locked down
The kexec_load() syscall permits the loading and execution of arbitrary
code in ring 0, which is something that lock-down is meant to prevent. It
makes sense to disable kexec_load() in this situation.

This does not affect kexec_file_load() syscall which can check for a
signature on the image to be booted.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
David Howells
49fcf732bd lockdown: Enforce module signatures if the kernel is locked down
If the kernel is locked down, require that all modules have valid
signatures that we can verify.

I have adjusted the errors generated:

 (1) If there's no signature (ENODATA) or we can't check it (ENOPKG,
     ENOKEY), then:

     (a) If signatures are enforced then EKEYREJECTED is returned.

     (b) If there's no signature or we can't check it, but the kernel is
	 locked down then EPERM is returned (this is then consistent with
	 other lockdown cases).

 (2) If the signature is unparseable (EBADMSG, EINVAL), the signature fails
     the check (EKEYREJECTED) or a system error occurs (eg. ENOMEM), we
     return the error we got.

Note that the X.509 code doesn't check for key expiry as the RTC might not
be valid or might not have been transferred to the kernel's clock yet.

 [Modified by Matthew Garrett to remove the IMA integration. This will
  be replaced with integration with the IMA architecture policy
  patchset.]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <matthewgarrett@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Linus Torvalds
287c55ed7d Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kernel thread signal handling fix from Eric Biederman:
 "I overlooked the fact that kernel threads are created with all signals
  set to SIG_IGN, and accidentally caused a regression in cifs and drbd
  when replacing force_sig with send_sig.

  This is my fix for that regression. I add a new function
  allow_kernel_signal which allows kernel threads to receive signals
  sent from the kernel, but continues to ignore all signals sent from
  userspace. This ensures the user space interface for cifs and drbd
  remain the same.

  These kernel threads depend on blocking networking calls which block
  until something is received or a signal is pending. Making receiving
  of signals somewhat necessary for these kernel threads.

  Perhaps someday we can cleanup those interfaces and remove
  allow_kernel_signal. If not allow_kernel_signal is pretty trivial and
  clearly documents what is going on so I don't think we will mind
  carrying it"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Allow cifs and drbd to receive their terminating signals
2019-08-19 16:17:59 -07:00
Michael Kelley
d0ff14fdc9 genirq: Properly pair kobject_del() with kobject_add()
If alloc_descs() fails before irq_sysfs_init() has run, free_desc() in the
cleanup path will call kobject_del() even though the kobject has not been
added with kobject_add().

Fix this by making the call to kobject_del() conditional on whether
irq_sysfs_init() has run.

This problem surfaced because commit aa30f47cf6 ("kobject: Add support
for default attribute groups to kobj_type") makes kobject_del() stricter
about pairing with kobject_add(). If the pairing is incorrrect, a WARNING
and backtrace occur in sysfs_remove_group() because there is no parent.

[ tglx: Add a comment to the code and make it work with CONFIG_SYSFS=n ]

Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1564703564-4116-1-git-send-email-mikelley@microsoft.com
2019-08-19 21:41:19 +02:00
David S. Miller
446bf64b61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge conflict of mlx5 resolved using instructions in merge
commit 9566e650bf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 11:54:03 -07:00
Thomas Gleixner
b6a32bbd87 genirq: Force interrupt threading on RT
Switch force_irqthreads from a boot time modifiable variable to a compile
time constant when CONFIG_PREEMPT_RT is enabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190816160923.12855-1-bigeasy@linutronix.de
2019-08-19 15:45:48 +02:00
Eric W. Biederman
33da8e7c81 signal: Allow cifs and drbd to receive their terminating signals
My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd.  I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN.  So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.

Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig.  As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL.  Which after the first signal will allow userspace
to send signals to these kernel threads.  At least for
wake_ack_receiver in drbd that does not appear actively wrong.

So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.

Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.

This change was partly inspired by similar cifs and drbd patches that
added allow_signal.

Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc9470b ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bcf0 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee109901f ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076fb ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-08-19 06:34:13 -05:00
Miroslav Benes
4ff96fb52c livepatch: Nullify obj->mod in klp_module_coming()'s error path
klp_module_coming() is called for every module appearing in the system.
It sets obj->mod to a patched module for klp_object obj. Unfortunately
it leaves it set even if an error happens later in the function and the
patched module is not allowed to be loaded.

klp_is_object_loaded() uses obj->mod variable and could currently give a
wrong return value. The bug is probably harmless as of now.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-08-19 13:03:37 +02:00
Andrea Righi
f1c6ece237 kprobes: Fix potential deadlock in kprobe_optimizer()
lockdep reports the following deadlock scenario:

 WARNING: possible circular locking dependency detected

 kworker/1:1/48 is trying to acquire lock:
 000000008d7a62b2 (text_mutex){+.+.}, at: kprobe_optimizer+0x163/0x290

 but task is already holding lock:
 00000000850b5e2d (module_mutex){+.+.}, at: kprobe_optimizer+0x31/0x290

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (module_mutex){+.+.}:
        __mutex_lock+0xac/0x9f0
        mutex_lock_nested+0x1b/0x20
        set_all_modules_text_rw+0x22/0x90
        ftrace_arch_code_modify_prepare+0x1c/0x20
        ftrace_run_update_code+0xe/0x30
        ftrace_startup_enable+0x2e/0x50
        ftrace_startup+0xa7/0x100
        register_ftrace_function+0x27/0x70
        arm_kprobe+0xb3/0x130
        enable_kprobe+0x83/0xa0
        enable_trace_kprobe.part.0+0x2e/0x80
        kprobe_register+0x6f/0xc0
        perf_trace_event_init+0x16b/0x270
        perf_kprobe_init+0xa7/0xe0
        perf_kprobe_event_init+0x3e/0x70
        perf_try_init_event+0x4a/0x140
        perf_event_alloc+0x93a/0xde0
        __do_sys_perf_event_open+0x19f/0xf30
        __x64_sys_perf_event_open+0x20/0x30
        do_syscall_64+0x65/0x1d0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe

 -> #0 (text_mutex){+.+.}:
        __lock_acquire+0xfcb/0x1b60
        lock_acquire+0xca/0x1d0
        __mutex_lock+0xac/0x9f0
        mutex_lock_nested+0x1b/0x20
        kprobe_optimizer+0x163/0x290
        process_one_work+0x22b/0x560
        worker_thread+0x50/0x3c0
        kthread+0x112/0x150
        ret_from_fork+0x3a/0x50

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(module_mutex);
                                lock(text_mutex);
                                lock(module_mutex);
   lock(text_mutex);

  *** DEADLOCK ***

As a reproducer I've been using bcc's funccount.py
(https://github.com/iovisor/bcc/blob/master/tools/funccount.py),
for example:

 # ./funccount.py '*interrupt*'

That immediately triggers the lockdep splat.

Fix by acquiring text_mutex before module_mutex in kprobe_optimizer().

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d5b844a2cf ("ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()")
Link: http://lkml.kernel.org/r/20190812184302.GA7010@xps-13
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-19 12:22:19 +02:00
Sebastian Andrzej Siewior
b0fdc01354 sched/core: Schedule new worker even if PI-blocked
If a task is PI-blocked (blocking on sleeping spinlock) then we don't want to
schedule a new kworker if we schedule out due to lock contention because !RT
does not do that as well. A spinning spinlock disables preemption and a worker
does not schedule out on lock contention (but spin).

On RT the RW-semaphore implementation uses an rtmutex so
tsk_is_pi_blocked() will return true if a task blocks on it. In this case we
will now start a new worker which may deadlock if one worker is waiting on
progress from another worker. Since a RW-semaphore starts a new worker on !RT,
we should do the same on RT.

XFS is able to trigger this deadlock.

Allow to schedule new worker if the current worker is PI-blocked.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190816160626.12742-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-19 10:57:26 +02:00
Linus Torvalds
5bba5c9c86 SPDX fixes for 5.3-rc5
Here are 4 small SPDX fixes for 5.3-rc5.  A few style fixes for some
 SPDX comments, added an SPDX tag for one file, and fix up some GPL
 boilerplate for another file.
 
 All of these have been in linux-next for a few weeks with no reported
 issues (they are comment changes only, so that's to be expected...)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXVkVSg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yneygCfdBxdIl98qXA2SRDLeKl/PkSJH1gAoLwnkoKq
 WK/gN0IMFf25UrItBsGe
 =b31n
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull SPDX fixes from Greg KH:
 "Here are four small SPDX fixes for 5.3-rc5.

  A few style fixes for some SPDX comments, added an SPDX tag for one
  file, and fix up some GPL boilerplate for another file.

  All of these have been in linux-next for a few weeks with no reported
  issues (they are comment changes only, so that's to be expected...)"

* tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
  i2c: stm32: Use the correct style for SPDX License Identifier
  intel_th: Use the correct style for SPDX License Identifier
  coccinelle: api/atomic_as_refcounter: add SPDX License Identifier
  kernel/configs: Replace GPL boilerplate code with SPDX identifier
2019-08-18 09:26:16 -07:00
Björn Töpel
36cc34358c xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:24:45 +02:00
Björn Töpel
0402acd683 xsk: remove AF_XDP socket from map when the socket is released
When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:24:45 +02:00
Stanislav Fomichev
b0e4701ce1 bpf: export bpf_map_inc_not_zero
Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
indicate that it's caller's responsibility to do proper locking.
Create and export bpf_map_inc_not_zero wrapper that properly
locks map_idr_lock. Will be used in the next commit to
hold a map while cloning a socket.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:18:54 +02:00
Christoph Hellwig
0d3d343560 genirq: remove the is_affinity_mask_valid hook
This override was only used by the ia64 SGI SN2 platform, which is
gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20190813072514.23299-29-hch@lst.de
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-08-16 14:32:26 -07:00
Linus Torvalds
2d63ba3e41 Power management fixes for 5.3-rc5
- Disable NVMe power optimization related to suspend-to-idle added
    recently on systems where PCIe ASPM is not able to put PCIe links
    into low-power states to prevent excess power from being drawn by
    the system while suspended (Rafael Wysocki).
 
  - Make the schedutil governor handle frequency limits changes
    properly in all cases (Viresh Kumar).
 
  - Prevent the cpufreq core from treating positive values returned
    by dev_pm_qos_update_request() as errors (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl1WqLMSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxCkAP/146AuGXj8tOdHxkpl6DVgm0WVRNxCtL
 Z9Y+1xBRBSYVZkeDsjzox995z8Ha/0tnMp6EPcnxebkFpRx3fyldXKUKxqJARPPi
 n2jGhqCPNcAHK2UPdGH8EvHOI2uWBMBa2jW2Qw9m0V/+9Zy58ZvKqso/+myFkz2S
 YRekJPADsI3GZW1SZ3dY4/12jcKsQt32TWaGOLqKx3R1J1BnpyxduXfqJ6FUrH9b
 P/F9cVb2UEbawh5QpNmfMsfBb/DsE08NQhPWe91m0VgcLd6IZsoNux0Rd8HJOvRM
 +5vh6qPTABnNN1+7blFw64/hCu1N2hq8KLl6DzPeKohysKiDkmLh3QGB+ISRpj+H
 5GKF8gnQFvN0fPJF8NU+eIZ0IaOryrooSu4TeCcAWAozJ0ln2mjNoC2h6U1B8Y29
 UH+e2z+6kVTHwjiTjPacjQ0wnkUctoiT71kMxQ8Q+GFG3fQcz3GFFM17eITnAI/Q
 ws1bPHn1ovxl1GmdQwQK3KnT1cK5/fApaVKQLJiRkUvZ1gCZ3ZcruPlh+qA5zpGf
 +RGPXn/Rm1LA1uCkS4j6REBp6vhcVJoVEVnEGzhovdtJcuJ9erlh5I2zz4UxURnn
 cHH48exFmwC+uBhIyQVuYOYgLU3naztBLFg1/l68sMQFonWjIQ/Hp1B9cIgigwbf
 5+BlT1llvIH3
 =eCcy
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These add a check to avoid recent suspend-to-idle power regression on
  systems with NVMe drives where the PCIe ASPM policy is "performance"
  (or when the kernel is built without ASPM support), fix an issue
  related to frequency limits in the schedutil cpufreq governor and fix
  a mistake related to the PM QoS usage in the cpufreq core introduced
  recently.

  Specifics:

   - Disable NVMe power optimization related to suspend-to-idle added
     recently on systems where PCIe ASPM is not able to put PCIe links
     into low-power states to prevent excess power from being drawn by
     the system while suspended (Rafael Wysocki).

   - Make the schedutil governor handle frequency limits changes
     properly in all cases (Viresh Kumar).

   - Prevent the cpufreq core from treating positive values returned by
     dev_pm_qos_update_request() as errors (Viresh Kumar)"

* tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  PCI/ASPM: Add pcie_aspm_enabled()
  cpufreq: schedutil: Don't skip freq update when limits change
  cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16 09:13:16 -07:00
Rafael J. Wysocki
a3ee2477c4 Merge branch 'pm-cpufreq'
* pm-cpufreq:
  cpufreq: schedutil: Don't skip freq update when limits change
  cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16 14:24:51 +02:00
Chuhong Yuan
d30bdfc0ec PM: sleep: Replace strncmp() with str_has_prefix()
Use str_has_prefix() instead of strncmp() which is less
straightforward in decode_state().

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-16 14:18:53 +02:00
Chuhong Yuan
35c35493b0 printk: Replace strncmp() with str_has_prefix()
strncmp(str, const, len) is error-prone because len is easy to have typo.
An example is the hard-coded len has counting error or sizeof(const)
forgets - 1.

So we prefer using newly introduced str_has_prefix() to substitute
such strncmp() to make code better.

Link: http://lkml.kernel.org/r/20190809071034.17279-1-hslester96@gmail.com
Cc: "Steven Rostedt" <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by:  Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Slightly updated and reformatted the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-08-16 09:54:08 +02:00
Wei Yongjun
e03250061b btf: fix return value check in btf_vmlinux_init()
In case of error, the function kobject_create_and_add() returns NULL
pointer not ERR_PTR(). The IS_ERR() test in the return value check
should be replaced with NULL test.

Fixes: 341dfcf8d7 ("btf: expose BTF info through sysfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-15 22:18:17 -07:00
Linus Torvalds
e83b009c5c dma-mapping fixes for 5.3-rc
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
    caused a regression in this merge window (Lucas Stach)
  - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
  - fix dma_mmap_coherent to not cause page attribute mismatches on
    coherent architectures like x86 (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1UFhILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOjexAAjPKLo4WGBGO1nd0btwXcI9A7jQTQlXrokmorDVzx
 5++GmTUBeEgvUJath5D3qpQTRZXo9Wb9oGMdS5U6bWJB+SbWtErM304t905TJoDM
 Cs7xcB1ZQeG/5OrQ+qGPgQCo6WO1dOl9FpaIptjNm4dn+OYhyO/YA+dgrJDwgkiA
 140RYUWa+Zhq3df4YqP4M4EnezLN1c4uE80wUxVQKDcq59sxCJek0QT0pUAMbdmQ
 /cUd2XSU113o1llmIRUh0Oj6VSEhWKHb+bdb8JfGndLzxvDcXZKl60tikWe6xpy2
 Ue0kkHRk6OPVRIxWkRjt8D+mlrCyNqN6HWx6eBmVnRKHxZ4ia2hYOFuYN9FFLLK+
 kCUlu5P/HUabBedKIxk4rbWITUqcRSviPD2WdnH2RWblvXNSDoSAufYuJ/9IGSoL
 P6a43DVKFesVF/MxeH9Ko8bnxMUO9Zn97GHcQIUplRwaqrnrCEPlvLVf/teswSQG
 C13rTnouZ0FA4z/uV96G6HfGIj87MLe/RovmLCMTeiSKrDpbcO7szP037Km73M+V
 UBmatoYCioVLxBjw3NkxCRc9UpDPdRUu31uVHrAarh4tutUASEWLrb6s9vFlGyED
 zis9IHWtIAYP3VfFtkXdZ7oDlqC/3KdEErHZuT+z4PK3Wj/QtQVfQ8SB79xFMneD
 V2E=
 =Jzmo
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-14 10:31:11 -07:00
Jakub Kicinski
708852dcac Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e1 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 16:24:57 -07:00
Eric Dumazet
cfcdef5e30 rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.

When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.

This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.

Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.

Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"

We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.

If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns

Signed-off-by: Eric Dumazet <edumazet@google.com>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
f48fe4c586 rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
When under overload conditions, __call_rcu_nocb_wake() will wake the
no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
are no ready-to-invoke callbacks, but only after a timer delay.  If the
no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
from __call_rcu_nocb_wake() is redundant.  This commit therefore makes
__call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
->nocb_bypass_timer is pending.  This requires adding a bit of ordering
of timer actions.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
296181d78d rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
Currently, __call_rcu_nocb_wake() advances callbacks each time that it
detects excessive numbers of callbacks, though only if it succeeds in
conditionally acquiring its leaf rcu_node structure's ->lock.  Despite
the conditional acquisition of ->lock, this does increase contention.
This commit therefore avoids advancing callbacks unless there are
callbacks in ->cblist whose grace period has completed and advancing
has not yet been done during this jiffy.

Note that this decision does not take the presence of new callbacks
into account.  That is because on this code path, there will always be
at least one new callback, namely the one we just enqueued.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
1d5a81c18d rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
Currently, nocb_cb_wait() advances callbacks on each pass through its
loop, though only if it succeeds in conditionally acquiring its leaf
rcu_node structure's ->lock.  Despite the conditional acquisition of
->lock, this does increase contention.  This commit therefore avoids
advancing callbacks unless there are callbacks in ->cblist whose grace
period has completed.

Note that nocb_cb_wait() doesn't worry about callbacks that have not
yet been assigned a grace period.  The idea is that the only reason for
nocb_cb_wait() to advance callbacks is to allow it to continue invoking
callbacks.  Time will tell whether this is the correct choice.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
23651d9b96 rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them.  However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number.  This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
273f034065 rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
When callbacks are in full flow, the common case is waiting for a
grace period, and this grace period will normally take a few jiffies to
complete.  It therefore isn't all that helpful for __call_rcu_nocb_wake()
to do a synchronous wakeup in this case.  This commit therefore turns this
into a timer-based deferred wakeup of the no-CBs grace-period kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
f7a81b12d6 rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
This commit causes locking, sleeping, and callback state to be printed
for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
rcutorture to complain.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
6aacd88d17 rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
d1b222c6be rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations.  However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking.  Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.

This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure.  Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ).  When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first.  During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations.  However,
a CPU could simply stop invoking call_rcu() at any time.  The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first).  This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass.  A ->nocb_bypass_timer is
used to provide the needed wakeups.

[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:37:32 -07:00
Paul E. McKenney
eda669a6a2 rcu/nocb: Atomic ->len field in rcu_segcblist structure
Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel.  This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
faca5c2509 rcu/nocb: Unconditionally advance and wake for excessive CBs
When there are excessive numbers of callbacks, and when either the
corresponding no-CBs callback kthread is asleep or there is no more
ready-to-invoke callbacks, and when least one callback is pending,
__call_rcu_nocb_wake() will advance the callbacks, but refrain from
awakening the corresponding no-CBs grace-period kthread.  However,
because rcu_advance_cbs_nowake() is used, it is possible (if a bit
unlikely) that the needed advancement could not happen due to a grace
period not being in progress.  Plus there will always be at least one
pending callback due to one having just now been enqueued.

This commit therefore attempts to advance callbacks and awakens the
no-CBs grace-period kthread when there are excessive numbers of callbacks
posted and when the no-CBs callback kthread is not in a position to do
anything helpful.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
4fd8c5f153 rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
The sleep/wakeup of the no-CBs grace-period kthreads is synchronized
using the ->nocb_lock of the first CPU corresponding to that kthread.
This commit provides a separate ->nocb_gp_lock for this purpose, thus
reducing contention on ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
523bddd553 rcu/nocb: Reduce contention at no-CBs invocation-done time
Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node
->lock to advance callbacks when done invoking the previous batch.
It does this while holding ->nocb_lock, which means that contention on
the leaf rcu_node ->lock visits itself on the ->nocb_lock.  This commit
therefore makes this lock acquisition conditional, forgoing callback
advancement when the leaf rcu_node ->lock is not immediately available.
(In this case, the no-CBs grace-period kthread will eventually do any
needed callback advancement.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
6608c3a027 rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
check to see if it is possible to advance callbacks without potentially
needing to awaken the grace-period kthread.  Given that the no-awaken
check can be done locklessly, this commit reverses the order, so that
rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
state before conditionally acquiring that lock, thus reducing the number
of needless acquistions of the leaf rcu_node structure's ->lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
9fcb09bddd rcu/nocb: Round down for number of no-CBs grace-period kthreads
Currently, when the square root of the number of CPUs is rounded down
by int_sqrt(), this round-down is applied to the number of callback
kthreads per grace-period kthreads.  This makes almost no difference
for large systems, but results in oddities such as three no-CBs
grace-period kthreads for a five-CPU system, which is a bit excessive.
This commit therefore causes the round-down to apply to the number of
no-CBs grace-period kthreads, so that systems with from four to eight
CPUs have only two no-CBs grace period kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
81c0b3d724 rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
A given rcu_data structure's ->nocb_lock can be acquired very frequently
by the corresponding CPU and occasionally by the corresponding no-CBs
grace-period and callbacks kthreads.  In particular, these two kthreads
will have frequent gaps between ->nocb_lock acquisitions that are roughly
a grace period in duration.  This means that any excessive ->nocb_lock
contention will be due to the CPU's acquisitions, and this in turn
enables a very naive contention-avoidance strategy to be quite effective.

This commit therefore modifies rcu_nocb_lock() to first
attempt a raw_spin_trylock(), and to atomically increment a
separate ->nocb_lock_contended across a raw_spin_lock().  This new
->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when
interrupts are enabled, with a spin-wait for contending acquisitions
to complete, thus allowing the kthreads a chance to acquire the lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
7f36ef82e5 rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
Currently, the code provides an extra wakeup for the no-CBs grace-period
kthread if one of its CPUs is generating excessive numbers of callbacks.
But satisfying though it is to wake something up when things are going
south, unless the thing being awakened can actually help solve the
problem, that extra wakeup does nothing but consume additional CPU time,
which is exactly what you don't want during a call_rcu() flood.

This commit therefore avoids doing anything if the corresponding
no-CBs callback kthread is going full tilt.  Otherwise, if advancing
callbacks immediately might help and if the leaf rcu_node structure's
lock is immediately available, this commit invokes a new variant of
rcu_advance_cbs() that advances callbacks only if doing so won't require
awakening the grace-period kthread (not to be confused with any of the
no-CBs grace-period kthreads).

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ce0a825e40 rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
It might be hard to imagine having more than two billion callbacks
queued on a single CPU's ->cblist, but someone will do it sometime.
This commit therefore makes __call_rcu_nocb_wake() handle this situation
by upgrading local variable "len" from "int" to "long".

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
383e133283 rcu/nocb: Never downgrade ->nocb_defer_wakeup in wake_nocb_gp_defer()
Currently, wake_nocb_gp_defer() simply stores whatever waketype was
passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded
to RCU_NOCB_WAKE, which could in turn delay callback processing.
This commit therefore adds a check so that wake_nocb_gp_defer() only
updates ->nocb_defer_wakeup when the update increases the forcefulness,
thus avoiding downgrades.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
aeeacd9d84 rcu/nocb: Enable re-awakening under high callback load
The __call_rcu_nocb_wake() function and its predecessors set
->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2
for forced reawakenings.  The former can result in a too-quick reawakening
when there are many callbacks ready to invoke and the latter prevents a
second reawakening.  This commit therefore sets ->qlen_last_fqs_check
to the current number of callbacks in both cases.  While in the area,
this commit also moves both assignments under ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
0bd55c6936 rcu/nohz: Turn off tick for offloaded CPUs
Historically, no-CBs CPUs allowed the scheduler-clock tick to be
unconditionally disabled on any transition to idle or nohz_full userspace
execution (see the rcu_needs_cpu() implementations).  Unfortunately,
the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs
use ->cblist, which might make users of battery-powered devices rather
unhappy.  This commit therefore adds explicit rcu_segcblist_is_offloaded()
checks to return to the historical energy-efficient semantics.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
969974e5c5 rcu/nocb: Suppress uninitialized false-positive in nocb_gp_wait()
Some compilers complain that wait_gp_seq might be used uninitialized
in nocb_gp_wait().  This cannot actually happen because when wait_gp_seq
is uninitialized, needwait_gp must be false, which prevents wait_gp_seq
from being used.  But this analysis is apparently beyond some compilers,
so this commit adds a bogus initialization of wait_gp_seq for the sole
purpose of suppressing the false-positive warning.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
921bb5fad1 rcu/nocb: Use build-time no-CBs check in rcu_pending()
Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even
in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded.
Given that rcu_pending() is on a fastpath, it makes sense to check for
CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded().
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c1ab99d66e rcu/nocb: Use build-time no-CBs check in rcu_core()
Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it
needs to know whether the current CPU is a no-CBs CPU.  Given that it is
not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this repeated runtime invocation wastes CPU.  This commit
therefore created a const on-stack variable to allow this check to be
done only once per rcu_core() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ec5ef87bac rcu/nocb: Use build-time no-CBs check in rcu_do_batch()
Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time
it needs to know whether the current CPU is a no-CBs CPU.  Given that it
is not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this per-callback invocation wastes CPU.  This commit therefore
created a const on-stack variable to allow this check to be done only
once per rcu_do_batch() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
4f9c1bc727 rcu/nocb: Remove obsolete nocb_gp_head and nocb_gp_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
2a777de757 rcu/nocb: Remove obsolete nocb_cb_tail and nocb_cb_head fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c035280f17 rcu/nocb: Remove obsolete nocb_q_count and nocb_q_count_lazy fields
This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
disable the ->cblist fields of offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e7f4c5b399 rcu/nocb: Remove obsolete nocb_head and nocb_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
5d6742b377 rcu/nocb: Use rcu_segcblist for no-CBs CPUs
Currently the RCU callbacks for no-CBs CPUs are queued on a series of
ad-hoc linked lists, which means that these callbacks cannot benefit
from "drive-by" grace periods, thus suffering needless delays prior
to invocation.  In addition, the no-CBs grace-period kthreads first
wait for callbacks to appear and later wait for a new grace period,
which means that callbacks appearing during a grace-period wait can
be delayed.  These delays increase memory footprint, and could even
result in an out-of-memory condition.

This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
rcu_segcblist structure that is already used by non-no-CBs CPUs.  It also
restructures the no-CBs grace-period kthread to be checking for incoming
callbacks while waiting for grace periods.  Also, instead of waiting
for a new grace period, it waits for the closest grace period that will
cause some of the callbacks to be safe to invoke.  All of these changes
reduce callback latency and thus the number of outstanding callbacks,
in turn reducing the probability of an out-of-memory condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e83e73f5b0 rcu/nocb: Leave ->cblist enabled for no-CBs CPUs
As a first step towards making no-CBs CPUs use the ->cblist, this commit
leaves the ->cblist enabled for these CPUs.  The main reason to make
no-CBs CPUs use ->cblist is to take advantage of callback numbering,
which will reduce the effects of missed grace periods which in turn will
reduce forward-progress problems for no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e6060b41c9 rcu/nocb: Allow lockless use of rcu_segcblist_empty()
Currently, rcu_segcblist_empty() assumes that the callback list is not
being changed by other CPUs, but upcoming changes will require it to
operate locklessly.  This commit therefore adds the needed READ_ONCE()
call, along with the WRITE_ONCE() calls when updating the callback list's
->head field.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
76c6927c3e rcu/nocb: Allow lockless use of rcu_segcblist_restempty()
Currently, rcu_segcblist_restempty() assumes that the callback list
is not being changed by other CPUs, but upcoming changes will require
it to operate locklessly.  This commit therefore adds the needed
READ_ONCE() calls, along with the WRITE_ONCE() calls when updating
the callback list.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ca5c825808 rcu/nocb: Remove deferred wakeup checks for extended quiescent states
The idea behind the checks for extended quiescent states at the end of
__call_rcu_nocb() is to handle cases where call_rcu() is invoked directly
from within an extended quiescent state, for example, from the idle loop.
However, this will result in a timer-mediated deferred wakeup, which
will cause the needed wakeup to happen within a jiffy or thereabouts.
There should be no forward-progress concerns, and if there are, the proper
response is to exit the extended quiescent state while executing the
endless blast of call_rcu() invocations, for example, using RCU_NONIDLE().
Given the more realistic case of an isolated call_rcu() invocation, there
should be no problem.

This commit therefore removes the checks for invoking call_rcu() within
an extended quiescent state for on no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
85f69b3212 rcu/nocb: Check for deferred nocb wakeups before nohz_full early exit
In theory, a timer is used to defer wakeups of no-CBs grace-period
kthreads when the wakeup cannot be done safely directly from the
call_rcu().  In practice, the one-jiffy delay is not always consistent
with timely callback invocation under heavy call_rcu() loads.  Therefore,
there are a number of checks for a pending deferred wakeup, including
from the scheduling-clock interrupt.  Unfortunately, this check follows
the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs.

This commit therefore moves the check for the pending deferred no-CB
wakeup to precede the rcu_nohz_full_cpu() early exit.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c00045be32 rcu/nocb: Make rcutree_migrate_callbacks() start at leaf rcu_node structure
Because rcutree_migrate_callbacks() is invoked infrequently and because
an exact snapshot of the grace-period state might save some callbacks a
second trip through a grace period, this function has used the root
rcu_node structure.  However, this safe-second-trip optimization
happens only if rcutree_migrate_callbacks() races with grace-period
initialization, so it is not worth the added mental load.  This commit
therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node
structures, as is done elsewhere.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
750d7f6a43 rcu/nocb: Add checks for offloaded callback processing
This commit is a preparatory patch for offloaded callbacks using the
same ->cblist structure used by non-offloaded callbacks.  It therefore
adds rcu_segcblist_is_offloaded() calls where they will be needed when
!rcu_segcblist_is_enabled() no longer flags the offloaded case.  It also
adds checks in rcu_do_batch() to ensure that there are no missed checks:
Currently, it should not be possible for offloaded execution to reach
rcu_do_batch(), though this will change later in this series.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ce5215c134 rcu/nocb: Use separate flag to indicate offloaded ->cblist
RCU callback processing currently uses rcu_is_nocb_cpu() to determine
whether or not the current CPU's callbacks are to be offloaded.
This works, but it is not so good for cache locality.  Plus use of
->cblist for offloaded callbacks will greatly increase the frequency
of these checks.  This commit therefore adds a ->offloaded flag to the
rcu_segcblist structure to provide a more flexible and cache-friendly
means of checking for callback offloading.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
1bb5f9b95a rcu/nocb: Use separate flag to indicate disabled ->cblist
NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but
forward-progress considerations would require that this pointer be both
NULL and non-NULL, which, absent a quantum-computer port of the Linux
kernel, simply won't happen.  This commit therefore creates as separate
->enabled flag to replace the current NULL checks.

[ paulmck: Add include files per 0day test robot and -next. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:34:50 -07:00
Paul E. McKenney
18cd8c93e6 rcu/nocb: Print gp/cb kthread hierarchy if dump_tree
This commit causes the no-CBs grace-period/callback hierarchy to be
printed to the console when the dump_tree kernel boot parameter is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
f7c612b000 rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameter
This commit changes the name of the rcu_nocb_leader_stride kernel
boot parameter to rcu_nocb_gp_stride in order to account for the new
distinction between callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
f7c9a9b664 rcu/nocb: Rename and document no-CB CB kthread sleep trace event
The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake()
event, which never was documented and is now misleading.  This commit
therefore changes "FollowerSleep" to "CBSleep", documents this, and
updates the documentation for "Sleep" as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
0bdc33daef rcu/nocb: Rename rcu_organize_nocb_kthreads() local variable
This commit renames rdp_leader to rdp_gp in order to account for the
new distinction between callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
0d52a6652f rcu/nocb: Rename wake_nocb_leader_defer() to wake_nocb_gp_defer()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
5f675ba6eb rcu/nocb: Rename __wake_nocb_leader() to __wake_nocb_gp()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.  While in the area, it also
updates local variables.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
5d62c08c5f rcu/nocb: Rename wake_nocb_leader() to wake_nocb_gp()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
9fa471a881 rcu/nocb: Rename nocb_follower_wait() to nocb_cb_wait()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
12f54c3a84 rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups.  The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks.  The non-leader rcuo kthreads
only invoke callbacks.

This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks.  However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up.  This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.

One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.

This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks.  Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group.  Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.

This change does introduce additional kthreads, however:

1.	The number of additional kthreads is about the square root of
	the number of CPUs, so that a 4096-CPU system would have only
	about 64 additional kthreads.  Note that recent changes
	decreased the number of rcuo kthreads by a factor of two
	(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
	this still represents a significant improvement on most systems.

2.	The leading "rcuo" of the rcuog kthreads should allow existing
	scripting to affinity these additional kthreads as needed, the
	same as for the rcuop and rcuos kthreads.  (There are no longer
	any rcuob kthreads.)

3.	A state-machine approach was considered and rejected.  Although
	this would allow the rcuo kthreads to continue their dual
	leader/follower roles, it complicates callback invocation
	and makes it more difficult to consolidate rcuo callback
	invocation with existing softirq callback invocation.

The introduction of rcuog kthreads should thus be acceptable.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
6484fe54b5 rcu/nocb: Update comments to prepare for forward-progress work
This commit simply rewords comments to prepare for leader nocb kthreads
doing only grace-period work and callback shuffling.  This will mean
the addition of replacement kthreads to invoke callbacks.  The "leader"
and "follower" thus become less meaningful, so the commit changes no-CB
comments with these strings to "GP" and "CB", respectively.  (Give or
take the usual grammatical transformations.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
58bf6f77c6 rcu/nocb: Rename rcu_data fields to prepare for forward-progress work
This commit simply renames rcu_data fields to prepare for leader
nocb kthreads doing only grace-period work and callback shuffling.
This will mean the addition of replacement kthreads to invoke callbacks.
The "leader" and "follower" thus become less meaningful, so the commit
changes no-CB fields with these strings to "gp" and "cb", respectively.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
31da067023 Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', 'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD
consolidate.2019.08.01b: Further consolidation cleanups
fixes.2019.08.12a: Miscellaneous fixes
lists.2019.08.13a: Optional lockdep arguments for RCU list macros
torture.2019.08.01b: Torture-test updates
2019-08-13 14:30:30 -07:00
Andrii Nakryiko
7fd785685e btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux
Expose kernel's BTF under the name vmlinux to be more uniform with using
kernel module names as file names in the future.

Fixes: 341dfcf8d7 ("btf: expose BTF info through sysfs")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-13 23:19:42 +02:00
Andrii Nakryiko
341dfcf8d7 btf: expose BTF info through sysfs
Make .BTF section allocated and expose its contents through sysfs.

/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
   kallsyms) and generate .BTF section by converting DWARF info into
   BTF. This section is not allocated and not mapped to any segment,
   though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
   convert binary file into linkable object file with automatically
   generated symbols _binary__btf_kernel_bin_start and
   _binary__btf_kernel_bin_end, pointing to start and end, respectively,
   of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
   kallsyms, if necessary). sysfs_btf.c then creates
   /sys/kernel/btf/kernel file and exposes embedded BTF contents through
   it. This allows, e.g., libbpf and bpftool access BTF info at
   well-known location, without resorting to searching for vmlinux image
   on disk (location of which is not standardized and vmlinux image
   might not be even available in some scenarios, e.g., inside qemu
   during testing).

Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).

If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.

v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
  due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;

v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-13 16:14:15 +02:00
Mukesh Ojha
511b44f759 rcu: Fix spelling mistake "greate"->"great"
This commit fixes a spelling mistake in file tree_exp.h.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-12 11:25:06 -07:00
Paul E. McKenney
b823cafa75 rcu: Remove redundant "if" condition from rcu_gp_is_expedited()
Because rcu_expedited_nesting is initialized to 1 and not decremented
until just before init is spawned, rcu_expedited_nesting is guaranteed
to be non-zero whenever rcu_scheduler_active == RCU_SCHEDULER_INIT.
This commit therefore removes this redundant "if" equality test.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-08-12 11:25:06 -07:00
Peter Zijlstra
e78a7614f3 idle: Prevent late-arriving interrupts from disrupting offline
Scheduling-clock interrupts can arrive late in the CPU-offline process,
after idle entry and the subsequent call to cpuhp_report_idle_dead().
Once execution passes the call to rcu_report_dead(), RCU is ignoring
the CPU, which results in lockdep complaints when the interrupt handler
uses RCU:

------------------------------------------------------------------------

=============================
WARNING: suspicious RCU usage
5.2.0-rc1+ #681 Not tainted
-----------------------------
kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/5/0.

stack backtrace:
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 dump_stack+0x5e/0x8b
 trigger_load_balance+0xa8/0x390
 ? tick_sched_do_timer+0x60/0x60
 update_process_times+0x3b/0x50
 tick_sched_handle+0x2f/0x40
 tick_sched_timer+0x32/0x70
 __hrtimer_run_queues+0xd3/0x3b0
 hrtimer_interrupt+0x11d/0x270
 ? sched_clock_local+0xc/0x74
 smp_apic_timer_interrupt+0x79/0x200
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:delay_tsc+0x22/0x50
Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
 cpuhp_report_idle_dead+0x31/0x60
 do_idle+0x1d5/0x200
 ? _raw_spin_unlock_irqrestore+0x2d/0x40
 cpu_startup_entry+0x14/0x20
 start_secondary+0x151/0x170
 secondary_startup_64+0xa4/0xb0

------------------------------------------------------------------------

This happens rarely, but can be forced by happen more often by
placing delays in cpuhp_report_idle_dead() following the call to
rcu_report_dead().  With this in place, the following rcutorture
scenario reproduces the problem within a few minutes:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"

This commit uses the crude but effective expedient of moving the disabling
of interrupts within the idle loop to precede the cpu_is_offline()
check.  It also invokes tick_nohz_idle_stop_tick() instead of
tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
interrupt.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
[ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-12 11:23:56 -07:00
Christoph Hellwig
4189ff2348 kernel: only define task_struct_whitelist conditionally
If CONFIG_ARCH_TASK_STRUCT_ALLOCATOR is set task_struct_whitelist is
never called, and thus generates a compiler warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20190812065524.19959-5-hch@lst.de
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-08-12 09:53:28 -07:00
Phil Auld
a46d14eca7 sched/fair: Use rq_lock/unlock in online_fair_sched_group
Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

[ tglx: Use rq_*lock_irq() ]

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.com
2019-08-12 14:45:34 +02:00
Linus Torvalds
dcbb4a1539 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Three fixlets for the scheduler:

   - Avoid double bandwidth accounting in the push & pull code

   - Use a sane FIFO priority for the Pressure Stall Information (PSI)
     thread.

   - Avoid permission checks when setting the scheduler params for the
     PSI thread"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Do not require setsched permission from the trigger creator
  sched/psi: Reduce psimon FIFO priority
  sched/deadline: Fix double accounting of rq/running bw in push & pull
2019-08-10 15:48:02 -07:00
Christoph Hellwig
33dcb37cef dma-mapping: fix page attributes for dma_mmap_*
All the way back to introducing dma_common_mmap we've defaulted to mark
the pages as uncached.  But this is wrong for DMA coherent devices.
Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
flag is only treated special on the alloc side for non-coherent devices.

Introduce a new dma_pgprot helper that deals with the check for coherent
devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
and we thus ensure no aliasing of page attributes happens, which makes
the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
remaining ones.

Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
we'll phase it out soon.

Fixes: 64ccc9c033 ("common: dma-mapping: add support for generic dma_mmap_* calls")
Reported-by: Shawn Anastasio <shawn@anastas.io>
Reported-by: Gavin Li <git@thegavinli.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10 19:52:45 +02:00
Lucas Stach
d8ad55538a dma-direct: don't truncate dma_required_mask to bus addressing capabilities
The dma required_mask needs to reflect the actual addressing capabilities
needed to handle the whole system RAM. When truncated down to the bus
addressing capabilities dma_addressing_limited() will incorrectly signal
no limitations for devices which are restricted by the bus_dma_mask.

Fixes: b4ebe60632 (dma-direct: implement complete bus_dma_mask handling)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-10 19:52:45 +02:00
Christoph Hellwig
cf14be0b41 dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign
a dma_addr to work.  Also skip it if the architecture needs
forced decryption handling, as that needs a kernel virtual
address.

Fixes: d98849aff8 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
2019-08-10 19:52:45 +02:00
Viresh Kumar
600f5badb7 cpufreq: schedutil: Don't skip freq update when limits change
To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within the new limits as soon as possible.

Trying to fix this by using only one flag, i.e. need_freq_update, can
lead to a race condition where the flag gets cleared without forcing us
to change the frequency at least once. And so this patch introduces
another flag to avoid that race condition.

Fixes: ecd2884291 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-10 13:53:19 +02:00
Rafael J. Wysocki
11f26633cc PM: suspend: Fix platform_suspend_prepare_noirq()
After commit ac9eafbe93 ("ACPI: PM: s2idle: Execute LPS0 _DSM
functions with suspended devices"), a NULL pointer may be dereferenced
if suspend-to-idle is attempted on a platform without "traditional"
suspend support due to invalid fall-through in
platform_suspend_prepare_noirq().

Fix that and while at it add missing braces in platform_resume_noirq().

Fixes: ac9eafbe93 ("ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-10 13:18:06 +02:00
Joel Fernandes (Google)
28875945ba rcu: Add support for consolidated-RCU reader checking
This commit adds RCU-reader checks to list_for_each_entry_rcu() and
hlist_for_each_entry_rcu().  These checks are optional, and are indicated
by a lockdep expression passed to a new optional argument to these two
macros.  If this optional lockdep expression is omitted, these two macros
act as before, checking for an RCU read-side critical section.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Update to eliminate return within macro and update comment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-09 11:00:35 -07:00
Thiago Jung Bauermann
e740815a97 dma-mapping: Remove dma_check_mask()
sme_active() is an x86-specific function so it's better not to call it from
generic code. Christoph Hellwig mentioned that "There is no reason why we
should have a special debug printk just for one specific reason why there
is a requirement for a large DMA mask.", so just remove dma_check_mask().

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-4-bauerman@linux.ibm.com
2019-08-09 22:52:07 +10:00
Thiago Jung Bauermann
47e5d8f9ed swiotlb: Remove call to sme_active()
sme_active() is an x86-specific function so it's better not to call it from
generic code.

There's no need to mention which memory encryption feature is active, so
just use a more generic message. Besides, other architectures will have
different names for similar technology.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-3-bauerman@linux.ibm.com
2019-08-09 22:52:06 +10:00
Daniel Jordan
ec9c7d1933 padata: initialize pd->cpu with effective cpumask
Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

    # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
    # modprobe tcrypt mode=215

...caused the following crash:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-<snip>
    Workqueue: pencrypt padata_parallel_worker
    RIP: 0010:padata_reorder+0xcb/0x180
    ...
    Call Trace:
     padata_do_serial+0x57/0x60
     pcrypt_aead_enc+0x3a/0x50 [pcrypt]
     padata_parallel_worker+0x9b/0xe0
     process_one_work+0x1b5/0x3f0
     worker_thread+0x4a/0x3c0
     ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf02 ("padata: Replace delayed timer with immediate workqueue in padata_reorder")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-08-09 15:13:52 +10:00
Rafael J. Wysocki
ac9eafbe93 ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
According to Section 3.5 of the "Intel Low Power S0 Idle" document [1],
Function 5 of the LPS0 _DSM is expected to be invoked when the system
configuration matches the criteria for entering the target low-power
state of the platform.  In particular, this means that all devices
should be suspended and in low-power states already when that function
is invoked.

This is not the case currently, however, because Function 5 of the
LPS0 _DSM is invoked by it before the "noirq" phase of device suspend,
which means that some devices may not have been put into low-power
states yet at that point.  That is a consequence of the previous
design of the suspend-to-idle flow that allowed the "noirq" phase of
device suspend and the "noirq" phase of device resume to be carried
out for multiple times while "suspended" (if any spurious wakeup
events were detected) and the point of the LPS0 _DSM Function 5
invocation was chosen so as to call it (and LPS0 _DSM Function 6
analogously) once per suspend-resume cycle (regardless of how many
times the "noirq" phases of device suspend and resume were carried
out while "suspended").

Now that the suspend-to-idle flow has been redesigned to carry out
the "noirq" phases of device suspend and resume once in each cycle,
the code can be reordered to follow the specification that it is
based on more closely.

For this purpose, add ->prepare_late and ->restore_early platform
callbacks for suspend-to-idle, to be executed, respectively, after
the "noirq" phase of suspending devices and before the "noirq"
phase of resuming them and make ACPI use them for the invocation
of LPS0 _DSM functions as appropriate.

While at it, move the LPS0 entry requirements check to be made
before invoking Functions 3 and 5 of the LPS0 _DSM (also once
per cycle) as follows from the specification [1].

Link: https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf # [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
2019-08-08 11:26:01 +02:00
Qais Yousef
5c3ceef9ad cpufreq: schedutil: fix equation in comment
scale_irq_capacity() call in schedutil_cpu_util() does

	util *= (max - irq)
	util /= max

But the comment says

	util *= (1 - irq)
	util /= max

Fix the comment to match what the scaling function does.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20190802104628.8410-1-qais.yousef@arm.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
67692435c4 sched: Rework pick_next_task() slow-path
Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
5f2a45fc9e sched: Allow put_prev_task() to drop rq->lock
Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
5ba553eff0 sched/fair: Expose newidle_balance()
For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
03b7fad167 sched: Add task_struct pointer to sched_class::set_curr_task
In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
10e7071b2f sched: Rework CPU hotplug task selection
The CPU hotplug task selection is the only place where we used
put_prev_task() on a task that is not current. While looking at that,
it occured to me that we can simplify all that by by using a custom
pick loop.

Since we don't need to put current, we can do away with the fake task
too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
2019-08-08 09:09:30 +02:00
Peter Zijlstra
f95d4eaee6 sched/{rt,deadline}: Fix set_next_task vs pick_next_task
Because pick_next_task() implies set_curr_task() and some of the
details haven't mattered too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
5feeb7837a sched: Fix kerneldoc comment for ia64_set_curr_task
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
99d84bf8c6 stop_machine: Fix stop_cpus_in_progress ordering
Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Dave Chiluk
de53fd7aed sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74 ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
139d025cda sched: Clean up active_mm reference counting
The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

    user -> user

	When switching between two user tasks, all we need to consider
	is switch_mm().

    user -> kernel

	When switching from a user task to a kernel task (which
	doesn't have an associated mm) we retain the last mm in our
	active_mm. Increment a reference count on active_mm.

  kernel -> kernel

	When switching between kernel threads, all we need to do is
	pass along the active_mm reference.

  kernel -> user

	When switching between a kernel and user task, we must switch
	from the last active_mm to the next mm, hoping of course that
	these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: luto@kernel.org
2019-08-08 09:09:30 +02:00
Peter Zijlstra
130d9c331b rcu/tree: Fix SCHED_FIFO params
A rather embarrasing mistake had us call sched_setscheduler() before
initializing the parameters passed to it.

Fixes: 1a763fd7c6 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
2019-08-08 09:09:30 +02:00
Peter Zijlstra
e57d143091 mutex: Fix up mutex_waiter usage
The patch moving bits into mutex.c was a little too much; by also
moving struct mutex_waiter a few less common CONFIGs would no longer
build.

Fixes: 5f35d5a66b ("locking/mutex: Make __mutex_owner static to mutex.c")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-08-08 09:09:25 +02:00
Ming Lei
491beed3b1 genirq/affinity: Create affinity mask for single vector
Since commit c66d4bd110 ("genirq/affinity: Add new callback for
(re)calculating interrupt sets"), irq_create_affinity_masks() returns
NULL in case of single vector. This change has caused regression on some
drivers, such as lpfc.

The problem is that single vector requests can happen in some generic cases:

  1) kdump kernel

  2) irq vectors resource is close to exhaustion.

If in that situation the affinity mask for a single vector is not created,
every caller has to handle the special case.

There is no reason why the mask cannot be created, so remove the check for
a single vector and create the mask.

Fixes: c66d4bd110 ("genirq/affinity: Add new callback for (re)calculating interrupt sets")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190805011906.5020-1-ming.lei@redhat.com
2019-08-08 08:47:55 +02:00
Marc Koderer
653a23ca7e Use kvmalloc in cgroups-v1
Instead of using its own logic for k-/vmalloc rely on
kvmalloc which is actually doing quite the same.

Signed-off-by: Marc Koderer <marc@koderer.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-08-07 11:37:58 -07:00
Marc Zyngier
b977fcf477 irqdomain/debugfs: Use PAs to generate fwnode names
Booting a large arm64 server (HiSi D05) leads to the following
shouting at boot time:

[   20.722132] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.730851] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.739560] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.748267] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.756975] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.765683] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.774391] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!

and many more... Evidently, we expect something a bit more informative
than ____ptrval____, and certainly we want all of our domains, not just
the first one.

For that, turn the %p used to generate the fwnode name into something
that won't be repainted (%pa). Given that we've now fixed all users to
pass a pointer to a PA, it will actually do the right thing.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2019-08-07 14:24:54 +01:00
Linus Torvalds
33920f1ec5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Yeah I should have sent a pull request last week, so there is a lot
  more here than usual:

   1) Fix memory leak in ebtables compat code, from Wenwen Wang.

   2) Several kTLS bug fixes from Jakub Kicinski (circular close on
      disconnect etc.)

   3) Force slave speed check on link state recovery in bonding 802.3ad
      mode, from Thomas Falcon.

   4) Clear RX descriptor bits before assigning buffers to them in
      stmmac, from Jose Abreu.

   5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF
      loops, from Nishka Dasgupta.

   6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean.

   7) Need to hold sock across skb->destructor invocation, from Cong
      Wang.

   8) IP header length needs to be validated in ipip tunnel xmit, from
      Haishuang Yan.

   9) Use after free in ip6 tunnel driver, also from Haishuang Yan.

  10) Do not use MSI interrupts on r8169 chips before RTL8168d, from
      Heiner Kallweit.

  11) Upon bridge device init failure, we need to delete the local fdb.
      From Nikolay Aleksandrov.

  12) Handle erros from of_get_mac_address() properly in stmmac, from
      Martin Blumenstingl.

  13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef
      Kadlecsik.

  14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with
      some devices, so revert. From Johannes Berg.

  15) Fix deadlock in rxrpc, from David Howells.

  16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue
      Haibing.

  17) Fix mvpp2 crash on module removal, from Matteo Croce.

  18) Fix race in genphy_update_link, from Heiner Kallweit.

  19) bpf_xdp_adjust_head() stopped working with generic XDP when we
      fixes generic XDP to support stacked devices properly, fix from
      Jesper Dangaard Brouer.

  20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from
      David Ahern.

  21) Several memory leaks in new sja1105 driver, from Vladimir Oltean"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits)
  net: dsa: sja1105: Fix memory leak on meta state machine error path
  net: dsa: sja1105: Fix memory leak on meta state machine normal path
  net: dsa: sja1105: Really fix panic on unregistering PTP clock
  net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
  net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
  net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
  net: sched: sample: allow accessing psample_group with rtnl
  net: sched: police: allow accessing police->params with rtnl
  net: hisilicon: Fix dma_map_single failed on arm64
  net: hisilicon: fix hip04-xmit never return TX_BUSY
  net: hisilicon: make hip04_tx_reclaim non-reentrant
  tc-testing: updated vlan action tests with batch create/delete
  net sched: update vlan action for batched events operations
  net: stmmac: tc: Do not return a fragment entry
  net: stmmac: Fix issues when number of Queues >= 4
  net: stmmac: xgmac: Fix XGMAC selftests
  be2net: disable bh with spin_lock in be_process_mcc
  net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
  net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
  net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
  ...
2019-08-06 17:11:59 -07:00
Catalin Marinas
63f0c60379 arm64: Introduce prctl() options to control the tagged user addresses ABI
It is not desirable to relax the ABI to allow tagged user addresses into
the kernel indiscriminately. This patch introduces a prctl() interface
for enabling or disabling the tagged ABI with a global sysctl control
for preventing applications from enabling the relaxed ABI (meant for
testing user-space prctl() return error checking without reconfiguring
the kernel). The ABI properties are inherited by threads of the same
application and fork()'ed children but cleared on execve(). A Kconfig
option allows the overall disabling of the relaxed ABI.

The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
MTE-specific settings like imprecise vs precise exceptions.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-06 18:08:45 +01:00
Suren Baghdasaryan
04e048cf09 sched/psi: Do not require setsched permission from the trigger creator
When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.

Suggested-by: Nick Kralevich <nnk@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: lizefan@huawei.com
Cc: mingo@redhat.com
Cc: akpm@linux-foundation.org
Cc: kernel-team@android.com
Cc: dennisszhou@gmail.com
Cc: dennis@kernel.org
Cc: hannes@cmpxchg.org
Cc: axboe@kernel.dk
Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.com
2019-08-06 12:49:18 +02:00
Peter Zijlstra
14f5c7b46a sched/psi: Reduce psimon FIFO priority
PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.

FIFO-99 is the very highest priority available to SCHED_FIFO and
it not a suitable default; it would indicate the psi work is the
most important work on the machine.

Since Real-Time tasks will have pre-allocated memory and locked it in
place, Real-Time tasks do not care about PSI. All it needs is to be
above OTHER.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2019-08-06 12:49:18 +02:00
Dietmar Eggemann
f4904815f9 sched/deadline: Fix double accounting of rq/running bw in push & pull
{push,pull}_dl_task() always calls {de,}activate_task() with .flags=0
which sets p->on_rq=TASK_ON_RQ_MIGRATING.

{push,pull}_dl_task()->{de,}activate_task()->{de,en}queue_task()->
{de,en}queue_task_dl() calls {sub,add}_{running,rq}_bw() since
p->on_rq==TASK_ON_RQ_MIGRATING.
So {sub,add}_{running,rq}_bw() in {push,pull}_dl_task() is
double-accounting for that task.

Fix it by removing rq/running bw accounting in [push/pull]_dl_task().

Fixes: 7dd7788411 ("sched/core: Unify p->on_rq updates")
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/20190802145945.18702-2-dietmar.eggemann@arm.com
2019-08-06 12:49:18 +02:00
Mukesh Ojha
a037d26922 locking/mutex: Use mutex flags macro instead of hard code
Use the mutex flag macro instead of hard code value inside
__mutex_owner().

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-2-git-send-email-mojha@codeaurora.org
2019-08-06 12:49:16 +02:00
Mukesh Ojha
5f35d5a66b locking/mutex: Make __mutex_owner static to mutex.c
__mutex_owner() should only be used by the mutex api's.
So, to put this restiction let's move the __mutex_owner()
function definition from linux/mutex.h to mutex.c file.

There exist functions that uses __mutex_owner() like
mutex_is_locked() and mutex_trylock_recursive(), So
to keep legacy thing intact move them as well and
export them.

Move mutex_waiter structure also to keep it private to the
file.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-1-git-send-email-mojha@codeaurora.org
2019-08-06 12:49:16 +02:00
Davidlohr Bueso
fce45cd411 locking/rwsem: Check for operations on an uninitialized rwsem
Currently rwsems is the only locking primitive that lacks this
debug feature. Add it under CONFIG_DEBUG_RWSEMS and do the magic
checking in the locking fastpath (trylock) operation such that
we cover all cases. The unlocking part is pretty straightforward.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: mingo@kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lkml.kernel.org/r/20190729044735.9632-1-dave@stgolabs.net
2019-08-06 12:49:15 +02:00
Waiman Long
91d2a812df locking/rwsem: Make handoff writer optimistically spin on owner
When the handoff bit is set by a writer, no other tasks other than
the setting writer itself is allowed to acquire the lock. If the
to-be-handoff'ed writer goes to sleep, there will be a wakeup latency
period where the lock is free, but no one can acquire it. That is less
than ideal.

To reduce that latency, the handoff writer will now optimistically spin
on the owner if it happens to be a on-cpu writer. It will spin until
it releases the lock and the to-be-handoff'ed writer can then acquire
the lock immediately without any delay. Of course, if the owner is not
a on-cpu writer, the to-be-handoff'ed writer will have to sleep anyway.

The optimistic spinning code is also modified to not stop spinning
when the handoff bit is set. This will prevent an occasional setting of
handoff bit from causing a bunch of optimistic spinners from entering
into the wait queue causing significant reduction in throughput.

On a 1-socket 22-core 44-thread Skylake system, the AIM7 shared_memory
workload was run with 7000 users. The throughput (jobs/min) of the
following kernels were as follows:

 1) 5.2-rc6
    - 8,092,486
 2) 5.2-rc6 + tip's rwsem patches
    - 7,567,568
 3) 5.2-rc6 + tip's rwsem patches + this patch
    - 7,954,545

Using perf-record(1), the %cpu time used by rwsem_down_write_slowpath(),
rwsem_down_write_failed() and their callees for the 3 kernels were 1.70%,
5.46% and 2.08% respectively.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: x86@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20190625143913.24154-1-longman@redhat.com
2019-08-06 12:49:15 +02:00
Thiago Jung Bauermann
c8424e776b MODSIGN: Export module signature definitions
IMA will use the module_signature format for append signatures, so export
the relevant definitions and factor out the code which verifies that the
appended signature trailer is valid.

Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
and be able to use mod_check_sig() without having to depend on either
CONFIG_MODULE_SIG or CONFIG_MODULES.

s390 duplicated the definition of struct module_signature so now they can
use the new <linux/module_signature.h> header instead.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-08-05 18:39:56 -04:00
Kalesh Singh
2c8db5bef9 PM/sleep: Expose suspend stats in sysfs
Userspace can get suspend stats from the suspend stats debugfs node.
Since debugfs doesn't have stable ABI, expose suspend stats in
sysfs under /sys/power/suspend_stats.

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-05 12:03:18 +02:00
Christoph Hellwig
14c5cebad5 memremap: move from kernel/ to mm/
memremap.c implements MM functionality for ZONE_DEVICE, so it really
should be in the mm/ directory, not the kernel/ one.

Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:01 -07:00
Mauro Carvalho Chehab
68d8681e97 kernel/signal.c: fix a kernel-doc markup
The kernel-doc parser doesn't handle expressions with %foo*.  Instead,
when an asterisk should be part of a constant, it uses an alternative
notation: `foo*`.

Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:00 -07:00
Linus Torvalds
234172f6bb add swiotlb support to arm
This fixes a cascade of regressions that originally started with
 the addition of the ia64 port, but only got fatal once we removed
 most uses of block layer bounce buffering in Linux 4.18.
 
 The reason is that while the original i386/PAE code that was the first
 architecture that supported > 4GB of memory without an iommu decided to
 leave bounce buffering to the subsystems, which in those days just mean
 block and networking as no one else consumer arbitrary userspace memory.
 
 Later with ia64, x86_64 and other ports we assumed that either an iommu
 or something that fakes it up ("software IOTLB" in beautiful Intel
 speak) is present and that subsystems can rely on that for dealing with
 addressing limitations in devices.   Except that the ARM LPAE scheme
 that added larger physical address to 32-bit ARM did not follow that
 scheme and thus only worked by chance and only for block and networking
 I/O directly to highmem.
 
 Long story, short fix - add swiotlb support to arm when build for LPAE
 platforms, which actuallys turns out to be pretty trivial with the
 modern dma-direct / swiotlb code to fix the Linux 4.18-ish regression.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1DFj8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPFqg/+Oh62VCFCkIK07NAeTq6EmrfHI8I1Wm/SFWPOOB+a
 vm7nMcSG3C8K8PRHzGc6Zk3SC1+RrHghcyKw54yLT1Mhroakv6Um7p2y8S3M4tmZ
 uEg8yYbtzxvuaY9T42s2msZURbBCEELzA2bYbQzgQ1zczRI1zuMI07ssMr91IQ91
 HC1OjAUoxUkp/+2uU/X2k6DvPQLSJSyWvKgbi1bjNpE+FRCKJP+2a2K3psBQuDBe
 aJXiz/kD2L/JNvF/e4c414d5GnGXwtIYs1kbskmnj3LeToS+JjX+6ZcENorpScIP
 c20s/3H6nsb14TFy548rJUlAHdcd9kOdeTw+0oPUliNLCogGs6FKNU4N5gVAo+bC
 AWDP0wMHMWkrVz6lQL9PR78IHrHOxFYS5/uHsqqdKo5YTsgaHnwKEiPxX1aiKQ67
 ovUrOnGRo4R9Y4YwD+BbHY9qw9jFMqazBdLWMivK5NxqltsahOug8w2emTFfXzQn
 m4APJYa0RVJA4mkh3ejcci5qHyyzPOjslyIJn7eaJPV2rknkxRn9UngkgJLnzHfc
 +lKiD1zaRy82nV4auPjYRiOdAoQN40YFB/RT16OVkjkT+jJEE2UAMjqh2SRlRusp
 Ce8vK7pw6VpDNGJRQveQA+1n9OR/jl0Jf8R7GFRrf9c/bM1J8GErJ6xS/EwNPrgI
 5dE=
 =D6Uy
 -----END PGP SIGNATURE-----

Merge tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mapping

Pull arm swiotlb support from Christoph Hellwig:
 "This fixes a cascade of regressions that originally started with the
  addition of the ia64 port, but only got fatal once we removed most
  uses of block layer bounce buffering in Linux 4.18.

  The reason is that while the original i386/PAE code that was the first
  architecture that supported > 4GB of memory without an iommu decided
  to leave bounce buffering to the subsystems, which in those days just
  mean block and networking as no one else consumed arbitrary userspace
  memory.

  Later with ia64, x86_64 and other ports we assumed that either an
  iommu or something that fakes it up ("software IOTLB" in beautiful
  Intel speak) is present and that subsystems can rely on that for
  dealing with addressing limitations in devices. Except that the ARM
  LPAE scheme that added larger physical address to 32-bit ARM did not
  follow that scheme and thus only worked by chance and only for block
  and networking I/O directly to highmem.

  Long story, short fix - add swiotlb support to arm when build for LPAE
  platforms, which actuallys turns out to be pretty trivial with the
  modern dma-direct / swiotlb code to fix the Linux 4.18-ish regression"

* tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mapping:
  arm: use swiotlb for bounce buffering on LPAE configs
  dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}
2019-08-02 08:44:33 -07:00
Linus Torvalds
35fca9f8a9 dma-mapping regression fixes for 5.3
- fix alignment issues introduced in the CMA allocation rework
    (Nicolin Chen)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1DE+YLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOrkA//Y7YxsJ9MaI0DNu9gYbYHg9u4ORBWXJ4fN67g2AUe
 rdrHdEHyd4uYduy4Ggi65oknfMZms6xYHGaFr1iDforBmOk0CXfvovocHzJkcV/R
 gtHSkp0L+SVmsLrWFXjd7pbkVhBziLEVrFlw07FbtBlNIeS2VvcAnU+CUcTKpxNi
 7MHhCqxTjVuUZ/qRKyyAnKHGoLCdLqTvpaf1uJq9ca838I/9E5UqitaxL/72G7ab
 q/fe93d94Xj3QNk+ekim6xBSD82VPU+OnFUf+f5dELDwyhgI0LAtz6iL8gH+NnK1
 P9cIIs2sFyBLXRQEaRXF5KhA97sjlWLioXYWs/AxvCphDeb1Zk4u3uGn0bGt90fQ
 g8DryY++nVo6sKpFsaNN7RQ9w/LfxejIcf0hVbNfH6tP8KDO19ds/05kE4O2LUC2
 gLOmPMt+dIOJlBQY0fUNrZN/IH6u60LnULmCWDiy7iY7VBJOf+H3zXM0UAJ+XEbs
 l2OG5vxkQ4hnFZVD1csNRd9gKYyjhrqOA0VssopgdBS53/seYMNopSbQsMTdp8J7
 V3c7Rozz3f62pwxJ7Jd7AwCgpvw8zHOESb5WOzi5DEmxAqRaJQ80H2DdAvQc3orL
 x0SummHKX2mY0cJdrFFbXkGoYt3sjJ+J6P+0CP11UIEIqtw4Zfa2hyWmbs8Q9SFt
 KYg=
 =rM5q
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping regression fixes from Christoph Hellwig:
 "Two related regression fixes for changes from this merge window to fix
  alignment issues introduced in the CMA allocation rework (Nicolin
  Chen)"

* tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: page-align the size in dma_free_contiguous()
  dma-contiguous: do not overwrite align in dma_alloc_contiguous()
2019-08-02 08:41:11 -07:00
Paul E. McKenney
60013d5d2b rcutorture: Aggressive forward-progress tests shouldn't block shutdown
The more aggressive forward-progress tests can interfere with rcutorture
shutdown, resulting in false-positive diagnostics.  This commit therefore
ends any such tests 30 seconds prior to shutdown.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
77e9752ce6 rcuperf: Make rcuperf kernel test more robust for !expedited mode
It is possible that the rcuperf kernel test runs concurrently with init
starting up.  During this time, the system is running all grace periods
as expedited.  However, rcuperf can also be run for normal GP tests.
Right now, it depends on a holdoff time before starting the test to
ensure grace periods start later. This works fine with the default
holdoff time however it is not robust in situations where init takes
greater than the holdoff time to finish running. Or, as in my case:

I modified the rcuperf test locally to also run a thread that did
preempt disable/enable in a loop. This had the effect of slowing down
init. The end result was that the "batches:" counter in rcuperf was 0
causing a division by 0 error in the results. This counter was 0 because
only expedited GPs seem to happen, not normal ones which led to the
rcu_state.gp_seq counter remaining constant across grace periods which
unexpectedly happen to be expedited. The system was running expedited
RCU all the time because rcu_unexpedited_gp() would not have run yet
from init.  In other words, the test would concurrently with init
booting in expedited GP mode.

To fix this properly, this commit waits until system_state is set to
SYSTEM_RUNNING before starting the test.  This change is made just
before kernel_init() invokes rcu_end_inkernel_boot(), and this latter
is what turns off boot-time expediting of RCU grace periods.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Denis Efremov
21f57546ce torture: Remove exporting of internal functions
The functions torture_onoff_cleanup() and torture_shuffle_cleanup()
are declared static and marked EXPORT_SYMBOL_GPL(), which is at best an
odd combination.  Because these functions are not used outside of the
kernel/torture.c file they are defined in, this commit removes their
EXPORT_SYMBOL_GPL() marking.

Fixes: cc47ae0830 ("rcutorture: Abstract torture-test cleanup")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Paul E. McKenney
bd1bfc51a3 rcutorture: Emulate userspace sojourn during call_rcu() floods
During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution allows a great many things to interrupt execution,
and rcutorture needs to also allow such interruptions.  This commit
therefore causes call_rcu() floods to occasionally invoke schedule(),
thus preventing spurious rcutorture failures due to other parts of the
kernel becoming irate at the call_rcu() flood events.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Xiao Yang
b3f3886c59 rcuperf: Fix perf_type module-parameter description
The rcu_bh rcuperf type was removed by commit 620d246065cd("rcuperf:
Remove the "rcu_bh" and "sched" torture types"), but it lives on in the
MODULE_PARM_DESC() of perf_type.  This commit therefore changes that
module-parameter description to substitute srcu for rcu_bh.

Signed-off-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
9147089bee rcu: Remove redundant debug_locks check in rcu_read_lock_sched_held()
The debug_locks flag can never be true at the end of
rcu_read_lock_sched_held() because it is already checked by the earlier
call todebug_lockdep_rcu_enabled().   This commit therefore removes this
redundant check.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:17:01 -07:00
Joel Fernandes (Google)
0a5b99f578 treewide: Rename rcu_dereference_raw_notrace() to _check()
The rcu_dereference_raw_notrace() API name is confusing.  It is equivalent
to rcu_dereference_raw() except that it also does sparse pointer checking.

There are only a few users of rcu_dereference_raw_notrace(). This patches
renames all of them to be rcu_dereference_raw_check() with the "_check()"
indicating sparse checking.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix checkpatch warnings about parentheses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:16:21 -07:00
Byungchul Park
3545832fc2 rcu: Change return type of rcu_spawn_one_boost_kthread()
The return value of rcu_spawn_one_boost_kthread() is not used any longer.
This commit therefore changes its return type from int to void, and
removes the cast to void from its callers.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
7e210a653e srcu: Avoid srcutorture security-based pointer obfuscation
Because pointer output is now obfuscated, and because what you really
want to know is whether or not the callback lists are empty, this commit
replaces the srcu_data structure's head callback pointer printout with
a single character that is "." is the callback list is empty or "C"
otherwise.

This is the only remaining user of rcu_segcblist_head(), so this
commit also removes this function's definition.  It also turns out that
rcu_segcblist_tail() no longer has any callers, so this commit removes
that function's definition while in the area.  They were both marked
"Interim", and their end has come.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
fbad01af8c rcu: Add destroy_work_on_stack() to match INIT_WORK_ONSTACK()
The synchronize_rcu_expedited() function has an INIT_WORK_ONSTACK(),
but lacks the corresponding destroy_work_on_stack().  This commit
therefore adds destroy_work_on_stack().

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
cdc694b235 rcu: Add kernel parameter to dump trace after RCU CPU stall warning
This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that,
when set, causes the trace buffer to be dumped after an RCU CPU stall
warning is printed.  This kernel boot parameter is disabled by default,
maintaining compatibility with previous behavior.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
1f3ebc8253 rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock()
Commit bb73c52bad ("rcu: Don't disable preemption for Tiny and Tree
RCU readers") removed the barrier() calls from rcu_read_lock() and
rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels.
Within RCU, this commit was OK, but it failed to account for things like
get_user() that can pagefault and that can be reordered by the compiler.
Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock()
can cause these page faults to migrate into RCU read-side critical
sections, which in CONFIG_PREEMPT=n kernels could result in too-short
grace periods and arbitrary misbehavior.  Please see commit 386afc9114
("spinlocks and preemption points need to be at least compiler barriers")
and Linus's commit 66be4e66a7 ("rcu: locking and unlocking need to
always be at least barriers"), this last of which restores the barrier()
call to both rcu_read_lock() and rcu_read_unlock().

This commit removes barrier() calls that are no longer needed given that
the addition of them in Linus's commit noted above.  The combination of
this commit and Linus's commit effectively reverts commit bb73c52bad
("rcu: Don't disable preemption for Tiny and Tree RCU readers").

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix embarrassing typo located by Alan Stern. ]
2019-08-01 14:05:51 -07:00
Paul E. McKenney
b55bd58555 time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

------------------------------------------------------------------------

================================
WARNING: inconsistent lock state
5.2.0-rc1+ #513 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
{IN-HARDIRQ-W} state was registered at:
  lock_acquire+0xb0/0x1c0
  _raw_spin_lock_irqsave+0x3c/0x50
  tick_broadcast_switch_to_oneshot+0xd/0x40
  tick_switch_to_oneshot+0x4f/0xd0
  hrtimer_run_queues+0xf3/0x130
  run_local_timers+0x1c/0x50
  update_process_times+0x1c/0x50
  tick_periodic+0x26/0xc0
  tick_handle_periodic+0x1a/0x60
  smp_apic_timer_interrupt+0x80/0x2a0
  apic_timer_interrupt+0xf/0x20
  _raw_spin_unlock_irqrestore+0x4e/0x60
  rcu_nocb_gp_kthread+0x15d/0x590
  kthread+0xf3/0x130
  ret_from_fork+0x3a/0x50
irq event stamp: 171
hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
softirqs last disabled at (0): [<0000000000000000>] 0x0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(tick_broadcast_lock);
  <Interrupt>
    lock(tick_broadcast_lock);

 *** DEADLOCK ***

1 lock held by migration/1/14:
 #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x5e/0x8b
 print_usage_bug+0x1fc/0x216
 ? print_shortest_lock_dependencies+0x1b0/0x1b0
 mark_lock+0x1f2/0x280
 __lock_acquire+0x1e0/0x18f0
 ? __lock_acquire+0x21b/0x18f0
 ? _raw_spin_unlock_irqrestore+0x4e/0x60
 lock_acquire+0xb0/0x1c0
 ? tick_broadcast_offline+0xf/0x70
 _raw_spin_lock+0x33/0x40
 ? tick_broadcast_offline+0xf/0x70
 tick_broadcast_offline+0xf/0x70
 tick_offline_cpu+0x16/0x30
 take_cpu_down+0x7d/0xa0
 multi_cpu_stop+0xa2/0xe0
 ? cpu_stop_queue_work+0xc0/0xc0
 cpu_stopper_thread+0x6d/0x100
 smpboot_thread_fn+0x169/0x240
 kthread+0xf3/0x130
 ? sort_range+0x20/0x20
 ? kthread_cancel_delayed_work_sync+0x10/0x10
 ret_from_fork+0x3a/0x50

------------------------------------------------------------------------

To reproduce, run the following rcutorture test:

        tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"

It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.

Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop().  Which in turn
had irqs enabled after return from cancel_delayed_work_sync().  Which is a
wrapper around __cancel_work_timer().  Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.

Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop():  The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.

This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop().  Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself.  A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so).  The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
519248f36d lockdep: Make print_lock() address visible
Security is a wonderful thing, but so is the ability to debug based on
lockdep warnings.  This commit therefore makes lockdep lock addresses
visible in the clear.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Joel Fernandes (Google)
cb4dbbfaa1 rcu: Simplify rcu_note_context_switch exit from critical section
Because __rcu_read_unlock() can be preempted just before the call to
rcu_read_unlock_special(), it is possible for a task to be preempted just
before it would have fully exited its RCU read-side critical section.
This would result in a needless extension of that critical section until
that task was resumed, which might in turn result in a needlessly
long grace period, needless RCU priority boosting, and needless
force-quiescent-state actions.  Therefore, rcu_note_context_switch()
invokes __rcu_read_unlock() followed by rcu_preempt_deferred_qs() when
it detects this situation.  This action by rcu_note_context_switch()
ends the RCU read-side critical section immediately.

Of course, once the task resumes, it will invoke rcu_read_unlock_special()
redundantly.  This is harmless because the fact that a preemption
happened means that interrupts, preemption, and softirqs cannot
have been disabled, so there would be no deferred quiescent state.
While ->rcu_read_lock_nesting remains less than zero, none of the
->rcu_read_unlock_special.b bits can be set, and they were all zeroed by
the call to rcu_note_context_switch() at task-preemption time.  Therefore,
setting ->rcu_read_unlock_special.b.exp_hint to false has no effect.

Therefore, the extra call to rcu_preempt_deferred_qs_irqrestore()
would return immediately.  With one possible exception, which is
if an expedited grace period started just as the task was being
resumed, which could leave ->exp_deferred_qs set.  This will cause
rcu_preempt_deferred_qs_irqrestore() to invoke rcu_report_exp_rdp(),
reporting the quiescent state, just as it should.  (Such an expedited
grace period won't affect the preemption code path due to interrupts
having already been disabled.)

But when rcu_note_context_switch() invokes __rcu_read_unlock(), it
is doing so with preemption disabled, hence __rcu_read_unlock() will
unconditionally defer the quiescent state, only to immediately invoke
rcu_preempt_deferred_qs(), thus immediately reporting the deferred
quiescent state.  It turns out to be safe (and faster) to instead
just invoke rcu_preempt_deferred_qs() without the __rcu_read_unlock()
middleman.

Because this is the invocation during the preemption (as opposed to
the invocation just after the resume), at least one of the bits in
->rcu_read_unlock_special.b must be set and ->rcu_read_lock_nesting
must be negative.  This means that rcu_preempt_need_deferred_qs() must
return true, avoiding the early exit from rcu_preempt_deferred_qs().
Thus, rcu_preempt_deferred_qs_irqrestore() will be invoked immediately,
as required.

This commit therefore simplifies the CONFIG_PREEMPT=y version of
rcu_note_context_switch() by removing the "else if" branch of its
"if" statement.  This change means that all callers that would have
invoked rcu_read_unlock_special() followed by rcu_preempt_deferred_qs()
will now simply invoke rcu_preempt_deferred_qs(), thus avoiding the
rcu_read_unlock_special() middleman when __rcu_read_unlock() is preempted.

Cc: rcu@vger.kernel.org
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
87446b4874 rcu: Make rcu_read_unlock_special() checks match raise_softirq_irqoff()
Threaded interrupts provide additional interesting interactions between
RCU and raise_softirq() that can result in self-deadlocks in v5.0-2 of
the Linux kernel.  These self-deadlocks can be provoked in susceptible
kernels within a few minutes using the following rcutorture command on
an 8-CPU system:

tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --configs "TREE03" --bootargs "threadirqs"

Although post-v5.2 RCU commits have at least greatly reduced the
probability of these self-deadlocks, this was entirely by accident.
Although this sort of accident should be rowdily celebrated on those
rare occasions when it does occur, such celebrations should be quickly
followed by a principled patch, which is what this patch purports to be.

The key point behind this patch is that when in_interrupt() returns
true, __raise_softirq_irqoff() will never attempt a wakeup.  Therefore,
if in_interrupt(), calls to raise_softirq*() are both safe and
extremely cheap.

This commit therefore replaces the in_irq() calls in the "if" statement
in rcu_read_unlock_special() with in_interrupt() and simplifies the
"if" condition to the following:

	if (irqs_were_disabled && use_softirq &&
	    (in_interrupt() ||
	     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
		raise_softirq_irqoff(RCU_SOFTIRQ);
	} else {
		/* Appeal to the scheduler. */
	}

The rationale behind the "if" condition is as follows:

1.	irqs_were_disabled:  If interrupts are enabled, we should
	instead appeal to the scheduler so as to let the upcoming
	irq_enable()/local_bh_enable() do the rescheduling for us.
2.	use_softirq: If this kernel isn't using softirq, then
	raise_softirq_irqoff() will be unhelpful.
3.	a.	in_interrupt(): If this returns true, the subsequent
		call to raise_softirq_irqoff() is guaranteed not to
		do a wakeup, so that call will be both very cheap and
		quite safe.
	b.	Otherwise, if !in_interrupt() the raise_softirq_irqoff()
		might do a wakeup, which is expensive and, in some
		contexts, unsafe.
		i.	The "exp" (an expedited RCU grace period is being
			blocked) says that the wakeup is worthwhile, and:
		ii.	The !.deferred_qs says that scheduler locks
			cannot be held, so the wakeup will be safe.

Backporting this requires considerable care, so no auto-backport, please!

Fixes: 05f415715c ("rcu: Speed up expedited GPs when interrupting RCU reader")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
d143b3d1cd rcu: Simplify rcu_read_unlock_special() deferred wakeups
In !use_softirq runs, we clearly cannot rely on raise_softirq() and
its lightweight bit setting, so we must instead do some form of wakeup.
In the absence of a self-IPI when interrupts are disabled, these wakeups
can be delayed until the next interrupt occurs.  This means that calling
invoke_rcu_core() doesn't actually do any expediting.

In this case, it is better to take the "else" clause, which sets the
current CPU's resched bits and, if there is an expedited grace period
in flight, uses IRQ-work to force the needed self-IPI.  This commit
therefore removes the "else if" clause that calls invoke_rcu_core().

Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Stanislav Fomichev
9babe825da bpf: always allocate at least 16 bytes for setsockopt hook
Since we always allocate memory, allocate just a little bit more
for the BPF program in case it need to override user input with
bigger value. The canonical example is TCP_CONGESTION where
input string might be too small to override (nv -> bbr or cubic).

16 bytes are chosen to match the size of TCP_CA_NAME_MAX and can
be extended in the future if needed.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-01 13:55:52 -07:00
Christian Brauner
3695eae5fe
pidfd: add P_PIDFD to waitid()
This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
that want to use the pidfd api to exclusively manage processes can do so
now.

One of the things this will unblock in the future is the ability to make
it possible to retrieve the exit status via waitid(P_PIDFD) for
non-parent processes if handed a _suitable_ pidfd that has this feature
set. This is similar to what you can do on FreeBSD with kqueue(). It
might even end up being possible to wait on a process as a non-parent if
an appropriate property is enabled on the pidfd.

With P_PIDFD no scoping of the process identified by the pidfd is
possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
rely on pid-based wait*() syscalls for now.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
2019-08-01 21:49:46 +02:00
Sebastian Andrzej Siewior
5d99b32a00 posix-timers: Move rcu_head out of it union
Timer deletion on PREEMPT_RT is prone to priority inversion and live
locks. The hrtimer code has a synchronization mechanism for this. Posix CPU
timers will grow one.

But that mechanism cannot be invoked while holding the k_itimer lock
because that can deadlock against the running timer callback. So the lock
must be dropped which allows the timer to be freed.

The timer free can be prevented by taking RCU readlock before dropping the
lock, but because the rcu_head is part of the 'it' union a concurrent free
will overwrite the hrtimer on which the task is trying to synchronize.

Move the rcu_head out of the union to prevent this.

[ tglx: Fixed up kernel-doc. Rewrote changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.965541887@linutronix.de
2019-08-01 20:51:25 +02:00
Thomas Gleixner
6945e5c2ab posix-timers: Rework cancel retry loops
As a preparatory step for adding the PREEMPT RT specific synchronization
mechanism to wait for a running timer callback, rework the timer cancel
retry loops so they call a common function. This allows trivial
substitution in one place.

Originally-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.874901027@linutronix.de
2019-08-01 20:51:24 +02:00
Thomas Gleixner
21670ee44f posix-timers: Cleanup the flag/flags confusion
do_timer_settime() has a 'flags' argument and uses 'flag' for the interrupt
flags, which is confusing at best.

Rename the argument so 'flags' can be used for interrupt flags as usual.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.782664411@linutronix.de
2019-08-01 20:51:24 +02:00
Anna-Maria Gleixner
c7e6d704a0 itimers: Prepare for PREEMPT_RT
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.

As a benefit the retry loop gains the missing cpu_relax() on !RT.

[ tglx: Split out of combo patch ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.690771827@linutronix.de
2019-08-01 20:51:24 +02:00
Anna-Maria Gleixner
51ae33092b alarmtimer: Prepare for PREEMPT_RT
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.

[ tglx: Split out of combo patch ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.508744705@linutronix.de
2019-08-01 20:51:23 +02:00
Juri Lelli
850377a875 sched/deadline: Ensure inactive_timer runs in hardirq context
SCHED_DEADLINE inactive timer needs to run in hardirq context (as
dl_task_timer already does) on PREEMPT_RT

Change the mode to HRTIMER_MODE_REL_HARD.

[ tglx: Fixed up the start site, so mode debugging works ]

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190731103715.4047-1-juri.lelli@redhat.com
2019-08-01 20:51:22 +02:00
Anna-Maria Gleixner
030dcdd197 timers: Prepare support for PREEMPT_RT
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted.  If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:

  - If the caller is on a remote CPU then it has to spin wait for the timer
    handler to complete. This can result in unbound priority inversion.

  - If the caller originates from the task which preempted the timer
    handler on the same CPU, then spin waiting for the timer handler to
    complete is never going to end.

To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.

This addresses both the priority inversion and the life lock issues.

This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.

The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.

As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.

[ tglx: Refactored it for mainline ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
2019-08-01 20:51:22 +02:00
Anna-Maria Gleixner
f61eff83ce hrtimer: Prepare support for PREEMPT_RT
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted.  If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling hrtimer_cancel() can lead to two issues:

  - If the caller is on a remote CPU then it has to spin wait for the timer
    handler to complete. This can result in unbound priority inversion.

  - If the caller originates from the task which preempted the timer
    handler on the same CPU, then spin waiting for the timer handler to
    complete is never going to end.

To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If hrtimer_cancel() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.

This addresses both the priority inversion and the life lock issues.

The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
hrtimer_cancel() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.

[ tglx: Refactored it for mainline ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.737767218@linutronix.de
2019-08-01 20:51:22 +02:00
Sebastian Andrzej Siewior
1842f5a427 hrtimer: Determine hard/soft expiry mode for hrtimer sleepers on RT
On PREEMPT_RT enabled kernels hrtimers which are not explicitely marked for
hard interrupt expiry mode are moved into soft interrupt context either for
latency reasons or because the hrtimer callback takes regular spinlocks or
invokes other functions which are not suitable for hard interrupt context
on PREEMPT_RT.

The hrtimer_sleeper callback is RT compatible in hard interrupt context,
but there is a latency concern: Untrusted userspace can spawn many threads
which arm timers for the same expiry time on the same CPU. On expiry that
causes a latency spike due to the wakeup of a gazillion threads.

OTOH, priviledged real-time user space applications rely on the low latency
of hard interrupt wakeups. These syscall related wakeups are all based on
hrtimer sleepers.

If the current task is in a real-time scheduling class, mark the mode for
hard interrupt expiry.

[ tglx: Split out of a larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.645792403@linutronix.de
2019-08-01 20:51:22 +02:00
Sebastian Andrzej Siewior
f5c2f0215e hrtimer: Move unmarked hrtimers to soft interrupt expiry on RT
On PREEMPT_RT not all hrtimers can be expired in hard interrupt context
even if that is perfectly fine on a PREEMPT_RT=n kernel, e.g. because they
take regular spinlocks. Also for latency reasons PREEMPT_RT tries to defer
most hrtimers' expiry into softirq context.

hrtimers marked with HRTIMER_MODE_HARD must be kept in hard interrupt
context expiry mode. Add the required logic.

No functional change for PREEMPT_RT=n kernels.

[ tglx: Split out of a larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.551967692@linutronix.de
2019-08-01 20:51:21 +02:00
Sebastian Andrzej Siewior
902a9f9c50 tick: Mark tick related hrtimers to expiry in hard interrupt context
The tick related hrtimers, which drive the scheduler tick and hrtimer based
broadcasting are required to expire in hard interrupt context for obvious
reasons.

Mark them so PREEMPT_RT kernels wont move them to soft interrupt expiry.

Make the horribly formatted RCU_NONIDLE bracket maze readable while at it.

No functional change, 

[ tglx: Split out from larger combo patch. Add changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.459144407@linutronix.de
2019-08-01 20:51:21 +02:00
Sebastian Andrzej Siewior
d2ab4cf494 watchdog: Mark watchdog_hrtimer to expire in hard interrupt context
The watchdog hrtimer must expire in hard interrupt context even on
PREEMPT_RT=y kernels as otherwise the hard/softlockup detection logic would
not work.

No functional change.

[ tglx: Split out from larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.262895510@linutronix.de
2019-08-01 20:51:20 +02:00
Sebastian Andrzej Siewior
30f9028b6c perf/core: Mark hrtimers to expire in hard interrupt context
To guarantee that the multiplexing mechanism and the hrtimer driven events
work on PREEMPT_RT enabled kernels it's required that the related hrtimers
expire in hard interrupt context. Mark them so PREEMPT_RT kernels wont
defer them to soft interrupt context.

No functional change.

[ tglx: Split out of larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.169509224@linutronix.de
2019-08-01 20:51:20 +02:00
Sebastian Andrzej Siewior
d5096aa65a sched: Mark hrtimers to expire in hard interrupt context
The scheduler related hrtimers need to expire in hard interrupt context
even on PREEMPT_RT enabled kernels. Mark then as such.

No functional change.

[ tglx: Split out from larger combo patch. Add changelog. ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.077004842@linutronix.de
2019-08-01 20:51:19 +02:00
Thomas Gleixner
0ab6a3ddba hrtimer: Make enqueue mode check work on RT
hrtimer_start_range_ns() has a WARN_ONCE() which verifies that a timer
which is marker for softirq expiry is not queued in the hard interrupt base
and vice versa.

When PREEMPT_RT is enabled, timers which are not explicitely marked to
expire in hard interrupt context are deferrred to the soft interrupt. So
the regular check would trigger.

Change the check, so when PREEMPT_RT is enabled, it is verified that the
timers marked for hard interrupt expiry are not tried to be queued for soft
interrupt expiry or any of the unmarked and softirq marked is tried to be
expired in hard interrupt context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 20:51:19 +02:00
Thomas Gleixner
9dd8813ed9 hrtimer/treewide: Use hrtimer_sleeper_start_expires()
hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
that possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 17:43:16 +02:00
Thomas Gleixner
01656464fc hrtimer: Provide hrtimer_sleeper_start_expires()
hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Create a wrapper around hrtimer_start_expires() to make that
possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 17:43:15 +02:00
Sebastian Andrzej Siewior
dbc1625fc9 hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.

Combine the initialization and spare a function call. Fixup all call sites.

This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.

No functional change.

[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
  	hrtimer_init_sleeper() and trivial polishing.
	Folded a fix from Stephen Rothwell for the vsoc code ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de
2019-08-01 17:43:15 +02:00
Linus Torvalds
d2eee9fca1 Two minor fixes:
- Fix trace event header include guards, as several did not match
    the #define to the #ifdef
 
  - Remove a redundant test to ftrace_graph_notrace_addr() that
    was accidentally added.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXUFythQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsjjAP41jDPQogZMaQZrPW1qyp69DHhuZXI2
 j/d7A2LG76mCggD/WHPBk8P98IHk7pO5Ndl4yLKS3plMCYqTcgylpJLMXQI=
 =yEpO
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two minor fixes:

   - Fix trace event header include guards, as several did not match the
     #define to the #ifdef

   - Remove a redundant test to ftrace_graph_notrace_addr() that was
     accidentally added"

* tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  fgraph: Remove redundant ftrace_graph_notrace_addr() test
  tracing: Fix header include guards in trace event headers
2019-07-31 10:26:59 -07:00
Thomas Gleixner
9261660636 kprobes: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch kprobes conditional over to CONFIG_PREEMPTION.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.516286187@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
30c937043b tracing: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the conditionals in the tracer over to CONFIG_PREEMPTION.

This is the first step to make the tracer work on RT. The other small
tweaks are submitted separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.409766323@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
01b1d88b09 rcu: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the conditionals in RCU to use CONFIG_PREEMPTION.

That's the first step towards RCU on RT. The further tweaks are work in
progress. This neither touches the selftest bits which need a closer look
by Paul.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.210156346@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
c1a280b68d sched/preempt: Use CONFIG_PREEMPTION where appropriate
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the preemption code, scheduler and init task over to use
CONFIG_PREEMPTION.

That's the first step towards RT in that area. The more complex changes are
coming separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:34 +02:00
Changbin Du
6c77221df9 fgraph: Remove redundant ftrace_graph_notrace_addr() test
We already have tested it before. The second one should be removed.
With this change, the performance should have little improvement.

Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com

Cc: stable@vger.kernel.org
Fixes: 9cd2992f2d ("fgraph: Have set_graph_notrace only affect function_graph tracer")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-30 21:50:03 -04:00
Thomas Gleixner
b744948725 hrtimer: Remove task argument from hrtimer_init_sleeper()
All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de
2019-07-30 23:57:51 +02:00
Christian Brauner
30b692d3b3
exit: make setting exit_state consistent
Since commit b191d6491b ("pidfd: fix a poll race when setting exit_state")
we unconditionally set exit_state to EXIT_ZOMBIE before calling into
do_notify_parent(). This was done to eliminate a race when querying
exit_state in do_notify_pidfd().
Back then we decided to do the absolute minimal thing to fix this and
not touch the rest of the exit_notify() function where exit_state is
set.
Since this fix has not caused any issues change the setting of
exit_state to EXIT_DEAD in the autoreap case to account for the fact hat
exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned
but also explicitly requested in [1] and makes the whole code more
consistent.

/* References */
[1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com

Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-30 19:57:14 +02:00
Thomas Huth
2b089bf8d1 kernel/configs: Replace GPL boilerplate code with SPDX identifier
The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's replace the old GPL boilerplate code with a proper SPDX
identifier instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-30 18:34:15 +02:00
Jessica Yu
38f054d549 modules: always page-align module section allocations
Some arches (e.g., arm64, x86) have moved towards non-executable
module_alloc() allocations for security hardening reasons. That means
that the module loader will need to set the text section of a module to
executable, regardless of whether or not CONFIG_STRICT_MODULE_RWX is set.

When CONFIG_STRICT_MODULE_RWX=y, module section allocations are always
page-aligned to handle memory rwx permissions. On some arches with
CONFIG_STRICT_MODULE_RWX=n however, when setting the module text to
executable, the BUG_ON() in frob_text() gets triggered since module
section allocations are not page-aligned when CONFIG_STRICT_MODULE_RWX=n.
Since the set_memory_* API works with pages, and since we need to call
set_memory_x() regardless of whether CONFIG_STRICT_MODULE_RWX is set, we
might as well page-align all module section allocations for ease of
managing rwx permissions of module sections (text, rodata, etc).

Fixes: 2eef1399a8 ("modules: fix BUG when load module with rodata=n")
Reported-by: Martin Kaiser <lists@kaiser.cx>
Reported-by: Bartosz Golaszewski <brgl@bgdev.pl>
Tested-by: David Lechner <david@lechnology.com>
Tested-by: Martin Kaiser <martin@kaiser.cx>
Tested-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-07-30 10:35:23 +02:00
Toke Høiland-Jørgensen
6f9d451ab1 xdp: Add devmap_hash map type for looking up devices by hashed index
A common pattern when using xdp_redirect_map() is to create a device map
where the lookup key is simply ifindex. Because device maps are arrays,
this leaves holes in the map, and the map has to be sized to fit the
largest ifindex, regardless of how many devices actually are actually
needed in the map.

This patch adds a second type of device map where the key is looked up
using a hashmap, instead of being used as an array index. This allows maps
to be densely packed, so they can be smaller.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29 13:50:48 -07:00
Toke Høiland-Jørgensen
fca16e5107 xdp: Refactor devmap allocation code for reuse
The subsequent patch to add a new devmap sub-type can re-use much of the
initialisation and allocation code, so refactor it into separate functions.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29 13:50:48 -07:00
Joel Fernandes (Google)
1caf7d50f4 pidfd: Add warning if exit_state is 0 during notification
Previously a condition got missed where the pidfd waiters are awakened
before the exit_state gets set. This can result in a missed notification
[1] and the polling thread waiting forever.

It is fixed now, however it would be nice to avoid this kind of issue
going unnoticed in the future. So just add a warning to catch it in the
future.

/* References */
[1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-29 17:20:19 +02:00
Nicolin Chen
f46cc01525 dma-contiguous: page-align the size in dma_free_contiguous()
According to the original dma_direct_alloc_pages() code:
{
	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;

	if (!dma_release_from_contiguous(dev, page, count))
		__free_pages(page, get_order(size));
}

The count parameter for dma_release_from_contiguous() was page
aligned before the right-shifting operation, while the new API
dma_free_contiguous() forgets to have PAGE_ALIGN() at the size.

So this patch simply adds it to prevent any corner case.

Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-29 09:50:04 +03:00
Nicolin Chen
c6622a425a dma-contiguous: do not overwrite align in dma_alloc_contiguous()
The dma_alloc_contiguous() limits align at CONFIG_CMA_ALIGNMENT for
cma_alloc() however it does not restore it for the fallback routine.
This will result in a size mismatch between the allocation and free
when running into the fallback routines after cma_alloc() fails, if
the align is larger than CONFIG_CMA_ALIGNMENT.

This patch adds a cma_align to take care of cma_alloc() and prevent
the align from being overwritten.

Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-29 09:50:04 +03:00
Linus Torvalds
e24ce84e85 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two fixes for the fair scheduling class:

   - Prevent freeing memory which is accessible by concurrent readers

   - Make the RCU annotations for numa groups consistent"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Use RCU accessors consistently for ->numa_group
  sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27 21:22:33 -07:00
Linus Torvalds
750991f9af Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A pile of perf related fixes:

  Kernel:
   - Fix SLOTS PEBS event constraints for Icelake CPUs

   - Add the missing mask bit to allow counting hardware generated
     prefetches on L3 for Icelake CPUs

   - Make the test for hypervisor platforms more accurate (as far as
     possible)

   - Handle PMUs correctly which override event->cpu

   - Yet another missing fallthrough annotation

  Tools:
     perf.data:
        - Fix loading of compressed data split across adjacent records
        - Fix buffer size setting for processing CPU topology perf.data
          header.

     perf stat:
        - Fix segfault for event group in repeat mode
        - Always separate "stalled cycles per insn" line, it was being
          appended to the "instructions" line.

     perf script:
        - Fix --max-blocks man page description.
        - Improve man page description of metrics.
        - Fix off by one in brstackinsn IPC computation.

     perf probe:
        - Avoid calling freeing routine multiple times for same pointer.

     perf build:
        - Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings
          treated as errors, breaking the build"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Mark expected switch fall-throughs
  perf/core: Fix creating kernel counters for PMUs that override event->cpu
  perf/x86: Apply more accurate check on hypervisor platform
  perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register
  perf/x86/intel: Fix SLOTS PEBS event constraint
  perf build: Do not use -Wshadow on gcc < 4.8
  perf probe: Avoid calling freeing routine multiple times for same pointer
  perf probe: Set pev->nargs to zero after freeing pev->args entries
  perf session: Fix loading of compressed data split across adjacent records
  perf stat: Always separate stalled cycles per insn
  perf stat: Fix segfault for event group in repeat mode
  perf tools: Fix proper buffer size for feature processing
  perf script: Fix off by one in brstackinsn IPC computation
  perf script: Improve man page description of metrics
  perf script: Fix --max-blocks man page description
2019-07-27 21:17:56 -07:00
Linus Torvalds
431f288ed7 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A set of locking fixes:

   - Address the fallout of the rwsem rework. Missing ACQUIREs and a
     sanity check to prevent a use-after-free

   - Add missing checks for unitialized mutexes when mutex debugging is
     enabled.

   - Remove the bogus code in the generic SMP variant of
     arch_futex_atomic_op_inuser()

   - Fixup the #ifdeffery in lockdep to prevent compile warnings"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: Test for initialized mutex
  locking/lockdep: Clean up #ifdef checks
  locking/lockdep: Hide unused 'class' variable
  locking/rwsem: Add ACQUIRE comments
  tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop
  lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
  locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
  locking/rwsem: Don't call owner_on_cpu() on read-owner
  futex: Cleanup generic SMP variant of arch_futex_atomic_op_inuser()
2019-07-27 21:10:26 -07:00
Daniel Jordan
065cf57713 padata: purge get_cpu and reorder_via_wq from padata_do_serial
With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-27 21:08:37 +10:00
Herbert Xu
6fc4dbcf02 padata: Replace delayed timer with immediate workqueue in padata_reorder
The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-27 21:08:36 +10:00
David S. Miller
28ba934d28 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix segfault in libbpf, from Andrii.

2) fix gso_segs access, from Eric.

3) tls/sockmap fixes, from Jakub and John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25 17:35:03 -07:00
Linus Torvalds
a29a0a467e Merge branch 'access-creds'
The access() (and faccessat()) credentials change can cause an
unnecessary load on the RCU machinery because every access() call ends
up freeing the temporary access credential using RCU.

This isn't really noticeable on small machines, but if you have hundreds
of cores you can cause huge slowdowns due to RCU storms.

It's easy to avoid: the temporary access crededntials aren't actually
normally accessed using RCU at all, so we can avoid the whole issue by
just marking them as such.

* access-creds:
  access: avoid the RCU grace period for the temporary subjective credentials
2019-07-25 08:36:29 -07:00
Qian Cai
a1dc0446d6 sched/core: Silence a warning in sched_init()
Compiling a kernel with both FAIR_GROUP_SCHED=n and RT_GROUP_SCHED=n
will generate a compiler warning:

  kernel/sched/core.c: In function 'sched_init':
  kernel/sched/core.c:5906:32: warning: variable 'ptr' set but not used

It is unnecessary to have both "alloc_size" and "ptr", so just combine
them.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/20190720012319.884-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:05 +02:00
Juri Lelli
a07db5c086 sched/core: Fix CPU controller for !RT_GROUP_SCHED
On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
move RT tasks between cgroups to which CPU controller has been attached;
but it is oddly possible to first move tasks around and then make them
RT (setschedule to FIFO/RR).

E.g.:

  # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
  # chrt -fp 10 $$
  # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  bash: echo: write error: Invalid argument
  # chrt -op 0 $$
  # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  # chrt -fp 10 $$
  # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  2345
  2598
  # chrt -p 2345
  pid 2345's current scheduling policy: SCHED_FIFO
  pid 2345's current scheduling priority: 10

Also, as Michal noted, it is currently not possible to enable CPU
controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
are any kernel RT threads in root cgroup, they can't be migrated to the
newly created CPU controller's root in cgroup_update_dfl_csses()).

Existing code comes with a comment saying the "we don't support RT-tasks
being in separate groups". Such comment is however stale and belongs to
pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
!RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
are not performed at all in these cases.

Make moving RT tasks between CPU controller groups viable by removing
special case check for RT (and DEADLINE) tasks.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:05 +02:00
Juri Lelli
710da3c8ea sched/core: Prevent race condition between cpuset and __sched_setscheduler()
No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.

Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.

The only exception is normalize_rt_tasks() which needs to work under
tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
this, as this function is only called by sysrq and, if that gets
triggered, DEADLINE guarantees are already gone out of the window
anyway.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:04 +02:00
Juri Lelli
1a763fd7c6 rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region
sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-8-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:03 +02:00
Juri Lelli
d74b27d63a cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).

Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:03 +02:00
Juri Lelli
1243dc518c cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem
Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core.  However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().

Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:02 +02:00
Juri Lelli
59d06cea11 sched/deadline: Fix bandwidth accounting at all levels after offline migration
If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:02 +02:00
Mathieu Poirier
f9a25f776d cpusets: Rebuild root domain deadline accounting information
When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:01 +02:00
Mathieu Poirier
4b211f2b12 sched/core: Streamle calls to task_rq_unlock()
Calls to task_rq_unlock() are done several times in the
__sched_setscheduler() function.  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-3-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Mathieu Poirier
c22645f4c8 sched/topology: Add partition_sched_domains_locked()
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Viresh Kumar
60e17f5cef sched/fair: Introduce fits_capacity()
The same formula to check utilization against capacity (after
considering capacity_margin) is already used at 5 different locations.

This patch creates a new macro, fits_capacity(), which can be used from
all these locations without exposing the details of it and hence
simplify code.

All the 5 code locations are updated as well to use it..

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/b477ac75a2b163048bdaeb37f57b4c3f04f75a31.1559631700.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:56 +02:00
Wanpeng Li
e0e8d4911e sched/isolation: Prefer housekeeping CPU in local node
In real product setup, there will be houseeking CPUs in each nodes, it
is prefer to do housekeeping from local node, fallback to global online
cpumask if failed to find houseeking CPU from local node.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpengli@tencent.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:55 +02:00
Yi Wang
65d74e9169 sched/stats: Fix unlikely() use of sched_info_on()
sched_info_on() is called with unlikely hint, however, the test
is to be a constant(1) on which compiler will do nothing when
make defconfig, so remove the hint.

Also, fix a lack of {}.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: up2wing@gmail.com
Cc: wang.liang82@zte.com.cn
Cc: xue.zhihong@zte.com.cn
Link: https://lkml.kernel.org/r/1562301307-43002-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:55 +02:00
Matthew Wilcox (Oracle)
7b3c92b85a sched/core: Convert get_task_struct() to return the task
Returning the pointer that was passed in allows us to write
slightly more idiomatic code.  Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Viresh Kumar
3c29e651e1 sched/fair: Fall back to sched-idle CPU if idle CPU isn't found
We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Viresh Kumar
43e9f7f231 sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq
Track how many tasks are present with SCHED_IDLE policy in each cfs_rq.
This will be used by later commits.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/0d3cdc427fc68808ad5bccc40e86ed0bf9da8bb4.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:53 +02:00
Paul E. McKenney
84ec3a0787 time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint

The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

	WARNING: inconsistent lock state
	5.2.0-rc1+ #513 Not tainted
	--------------------------------
	inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
	migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
	(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
	{IN-HARDIRQ-W} state was registered at:
	  lock_acquire+0xb0/0x1c0
	  _raw_spin_lock_irqsave+0x3c/0x50
	  tick_broadcast_switch_to_oneshot+0xd/0x40
	  tick_switch_to_oneshot+0x4f/0xd0
	  hrtimer_run_queues+0xf3/0x130
	  run_local_timers+0x1c/0x50
	  update_process_times+0x1c/0x50
	  tick_periodic+0x26/0xc0
	  tick_handle_periodic+0x1a/0x60
	  smp_apic_timer_interrupt+0x80/0x2a0
	  apic_timer_interrupt+0xf/0x20
	  _raw_spin_unlock_irqrestore+0x4e/0x60
	  rcu_nocb_gp_kthread+0x15d/0x590
	  kthread+0xf3/0x130
	  ret_from_fork+0x3a/0x50
	irq event stamp: 171
	hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
	hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
	softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
	softirqs last disabled at (0): [<0000000000000000>] 0x0

        [...]

To reproduce, run the following rcutorture test:

 $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"

It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.

Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop().  Which in turn
had irqs enabled after return from cancel_delayed_work_sync().  Which is a
wrapper around __cancel_work_timer().  Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.

Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop():  The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.

This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop().  Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself.  A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so).  The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:53 +02:00
Vincent Guittot
f6cad8df6b sched/fair: Fix imbalance due to CPU affinity
The load_balance() has a dedicated mecanism to detect when an imbalance
is due to CPU affinity and must be handled at parent level. In this case,
the imbalance field of the parent's sched_group is set.

The description of sg_imbalanced() gives a typical example of two groups
of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
group and 3 CPUs of the second group. Something like:

	{ 0 1 2 3 } { 4 5 6 7 }
	        *     * * *

But the load_balance fails to fix this UC on my octo cores system
made of 2 clusters of quad cores.

Whereas the load_balance is able to detect that the imbalanced is due to
CPU affinity, it fails to fix it because the imbalance field is cleared
before letting parent level a chance to run. In fact, when the imbalance is
detected, the load_balance reruns without the CPU with pinned tasks. But
there is no other running tasks in the situation described above and
everything looks balanced this time so the imbalance field is immediately
cleared.

The imbalance field should not be cleared if there is no other task to move
when the imbalance is detected.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:52 +02:00
Valentin Schneider
9434f9f5d1 sched/fair: Change task_numa_work() storage to static
There are no callers outside of fair.c.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:52 +02:00
Valentin Schneider
b34920d4ce sched/fair: Move task_numa_work() init to init_numa_balancing()
We only need to set the callback_head worker function once, do it
during sched_fork().

While at it, move the comment regarding double task_work addition to
init_numa_balancing(), since the double add sentinel is first set there.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:51 +02:00
Valentin Schneider
d35927a144 sched/fair: Move init_numa_balancing() below task_numa_work()
To reference task_numa_work() from within init_numa_balancing(), we
need the former to be declared before the latter. Do just that.

This is a pure code movement.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:51 +02:00
Thomas Gleixner
0c09ab96fc cpu/hotplug: Cache number of online CPUs
Re-evaluating the bitmap wheight of the online cpus bitmap in every
invocation of num_online_cpus() over and over is a pretty useless
exercise. Especially when num_online_cpus() is used in code paths
like the IPI delivery of x86 or the membarrier code.

Cache the number of online CPUs in the core and just return the cached
variable. The accessor function provides only a snapshot when used without
protection against concurrent CPU hotplug.

The storage needs to use an atomic_t because the kexec and reboot code
(ab)use set_cpu_online() in their 'shutdown' handlers without any form of
serialization as pointed out by Mathieu. Regular CPU hotplug usage is
properly serialized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907091622590.1634@nanos.tec.linutronix.de
2019-07-25 15:48:01 +02:00
Thomas Gleixner
e797bda3fd smp/hotplug: Track booted once CPUs in a cpumask
The booted once information which is required to deal with the MCE
broadcast issue on X86 correctly is stored in the per cpu hotplug state,
which is perfectly fine for the intended purpose.

X86 needs that information for supporting NMI broadcasting via shortcuts,
but retrieving it from per cpu data is cumbersome.

Move it to a cpumask so the information can be checked against the
cpu_present_mask quickly.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190722105219.818822855@linutronix.de
2019-07-25 15:47:37 +02:00
Bart Van Assche
8c779229d0 locking/lockdep: Report more stack trace statistics
Report the number of stack traces and the number of stack trace hash
chains. These two numbers are useful because these allow to estimate
the number of stack trace hash collisions.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-5-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:28 +02:00
Bart Van Assche
12593b7467 locking/lockdep: Reduce space occupied by stack traces
Although commit 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys
for workqueues") unregisters dynamic lockdep keys when a workqueue is
destroyed, a side effect of that commit is that all stack traces
associated with the lockdep key are leaked when a workqueue is destroyed.
Fix this by storing each unique stack trace once. Other changes in this
patch are:

- Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
- Store a pointer to a stack trace in struct lock_class and struct
  lock_list instead of storing 'nr_entries' and 'offset'.

This patch avoids that the following program triggers the "BUG:
MAX_STACK_TRACE_ENTRIES too low!" complaint:

	#include <fcntl.h>
	#include <unistd.h>

	int main()
	{
		for (;;) {
			int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
			close(fd);
		}
	}

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:27 +02:00
Bart Van Assche
a297042164 stacktrace: Constify 'entries' arguments
Make it clear to humans and to the compiler that the stack trace
('entries') arguments are not modified.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-3-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:26 +02:00
Bart Van Assche
364f6afc4f locking/lockdep: Make it clear that what lock_class::key points at is not modified
This patch does not change the behavior of the lockdep code.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-2-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:26 +02:00
Leonard Crestez
4ce54af8b3 perf/core: Fix creating kernel counters for PMUs that override event->cpu
Some hardware PMU drivers will override perf_event.cpu inside their
event_init callback. This causes a lockdep splat when initialized through
the kernel API:

 WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208
 pc : ctx_sched_out+0x78/0x208
 Call trace:
  ctx_sched_out+0x78/0x208
  __perf_install_in_context+0x160/0x248
  remote_function+0x58/0x68
  generic_exec_single+0x100/0x180
  smp_call_function_single+0x174/0x1b8
  perf_install_in_context+0x178/0x188
  perf_event_create_kernel_counter+0x118/0x160

Fix this by calling perf_install_in_context with event->cpu, just like
perf_event_open

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Frank Li <Frank.li@nxp.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:41:31 +02:00
Sebastian Andrzej Siewior
6c11c6e3d5 locking/mutex: Test for initialized mutex
An uninitialized/ zeroed mutex will go unnoticed because there is no
check for it. There is a magic check in the unlock's slowpath path which
might go unnoticed if the unlock happens in the fastpath.

Add a ->magic check early in the mutex_lock() and mutex_trylock() path.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:27 +02:00
Arnd Bergmann
30a35f79fa locking/lockdep: Clean up #ifdef checks
As Will Deacon points out, CONFIG_PROVE_LOCKING implies TRACE_IRQFLAGS,
so the conditions I added in the previous patch, and some others in the
same file can be simplified by only checking for the former.

No functional change.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Fixes: 886532aee3 ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190628102919.2345242-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:26 +02:00
Arnd Bergmann
68037aa782 locking/lockdep: Hide unused 'class' variable
The usage is now hidden in an #ifdef, so we need to move
the variable itself in there as well to avoid this warning:

  kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yuyang Du <duyuyang@gmail.com>
Cc: frederic@kernel.org
Fixes: 68d41d8c94 ("locking/lockdep: Fix lock used or unused stats error")
Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:25 +02:00
Peter Zijlstra
6ffddfb9e1 locking/rwsem: Add ACQUIRE comments
Since we just reviewed read_slowpath for ACQUIRE correctness, add a
few coments to retain our findings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:25 +02:00
Peter Zijlstra
99143f82a2 lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
While reviewing another read_slowpath patch, both Will and I noticed
another missing ACQUIRE, namely:

  X = 0;

  CPU0			CPU1

  rwsem_down_read()
    for (;;) {
      set_current_state(TASK_UNINTERRUPTIBLE);

                        X = 1;
                        rwsem_up_write();
                          rwsem_mark_wake()
                            atomic_long_add(adjustment, &sem->count);
                            smp_store_release(&waiter->task, NULL);

      if (!waiter.task)
        break;

      ...
    }

  r = X;

Allows 'r == 0'.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:24 +02:00
Jan Stancek
e1b98fa316 locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
LTP mtest06 has been observed to occasionally hit "still mapped when
deleted" and following BUG_ON on arm64.

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When the pagefault handler (under mmap_sem read lock) calls
find_vma(), vmacache_valid() wrongly reports vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without an ACQUIRE on sem->count:

  down_read()
    __down_read()
      rwsem_down_read_failed()
        __rwsem_down_read_failed_common()
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

The problem can be reproduced by running LTP mtest06 in a loop and
building the kernel (-j $NCPUS) in parallel. It does reproduces since
v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
triggers reliably in about an hour.

The patched kernel ran fine for 10+ hours.

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dbueso@suse.de
Fixes: 4b486b535c ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:23 +02:00
Waiman Long
7813430057 locking/rwsem: Don't call owner_on_cpu() on read-owner
For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.

As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,

  BUG: KASAN: use-after-free in rwsem_down_write_slowpath
  (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
  /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().

Reported-by: Luis Henriques <lhenriques@suse.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 94a9717b3c ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:22 +02:00
Jann Horn
cb361d8cde sched/fair: Use RCU accessors consistently for ->numa_group
The old code used RCU annotations and accessors inconsistently for
->numa_group, which can lead to use-after-frees and NULL dereferences.

Let all accesses to ->numa_group use proper RCU helpers to prevent such
issues.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 8c8a743c50 ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:05 +02:00
Jann Horn
16d51a590a sched/fair: Don't free p->numa_faults with concurrent readers
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:04 +02:00
Linus Torvalds
d7852fbd0f access: avoid the RCU grace period for the temporary subjective credentials
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
->cred entirely.  Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-24 10:12:09 -07:00
Christoph Hellwig
66d7780f18 dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}
Check that the pfn returned from arch_dma_coherent_to_pfn refers to
a valid page and reject the mmap / get_sgtable requests otherwise.

Based on the arm implementation of the mmap and get_sgtable methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vignesh Raghavendra <vigneshr@ti.com>
2019-07-24 17:28:54 +02:00
Peng Wang
a581563f1b cgroup: minor tweak for logic to get cgroup css
We could only handle the case that css exists
and css_try_get_online() fails.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-07-23 15:46:33 -07:00
Markus Elfring
85db002337 cgroup: Replace a seq_printf() call by seq_puts() in cgroup_print_ss_mask()
A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function “seq_puts”.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-07-23 15:44:00 -07:00
Ilya Leoshkevich
d9b8aadaff bpf: fix narrower loads on s390
The very first check in test_pkt_md_access is failing on s390, which
happens because loading a part of a struct __sk_buff field produces
an incorrect result.

The preprocessed code of the check is:

{
	__u8 tmp = *((volatile __u8 *)&skb->len +
		((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8)));
	if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2;
};

clang generates the following code for it:

      0:	71 21 00 03 00 00 00 00	r2 = *(u8 *)(r1 + 3)
      1:	61 31 00 00 00 00 00 00	r3 = *(u32 *)(r1 + 0)
      2:	57 30 00 00 00 00 00 ff	r3 &= 255
      3:	5d 23 00 1d 00 00 00 00	if r2 != r3 goto +29 <LBB0_10>

Finally, verifier transforms it to:

  0: (61) r2 = *(u32 *)(r1 +104)
  1: (bc) w2 = w2
  2: (74) w2 >>= 24
  3: (bc) w2 = w2
  4: (54) w2 &= 255
  5: (bc) w2 = w2

The problem is that when verifier emits the code to replace a partial
load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of
struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a
bitwise AND, it assumes that the machine is little endian and
incorrectly decides to use a shift.

Adjust shift count calculation to account for endianness.

Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-23 13:59:33 -07:00
Rafael J. Wysocki
8eb0fd3b55 PM: sleep: Integrate suspend-to-idle with generig suspend flow
After previous changes the suspend-to-idle code flow can be
integrated more tightly with the generic system suspend code flow
by making suspend_enter() call s2idle_loop() later and removing
the direct invocations of dpm_noirq_begin(),
dpm_noirq_suspend_devices(), dpm_noirq_end(), and
dpm_noirq_resume_devices() from the latter, so do that.

This change is not expected to alter functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:46:50 +02:00
Rafael J. Wysocki
56b9918490 PM: sleep: Simplify suspend-to-idle control flow
After commit 33e4f80ee6 ("ACPI / PM: Ignore spurious SCI wakeups
from suspend-to-idle") the "noirq" phases of device suspend and
resume may run for multiple times during suspend-to-idle, if there
are spurious system wakeup events while suspended.  However, this
is complicated and fragile and actually unnecessary.

The main reason for doing this is that on some systems the EC may
signal system wakeup events (power button events, for example) as
well as events that should not cause the system to resume (spurious
system wakeup events).  Thus, in order to determine whether or not
a given event signaled by the EC while suspended is a proper system
wakeup one, the EC GPE needs to be dispatched and to start with that
was achieved by allowing the ACPI SCI action handler to run, which
was only possible after calling resume_device_irqs().

However, dispatching the EC GPE this way turned out to take too much
time in some cases and some EC events might be missed due to that, so
commit 68e2201185 ("ACPI: EC: Dispatch the EC GPE directly on
s2idle wake") started to dispatch the EC GPE right after a wakeup
event has been detected, so in fact the full ACPI SCI action handler
doesn't need to run any more to deal with the wakeups coming from the
EC.

Use this observation to simplify the suspend-to-idle control flow
so that the "noirq" phases of device suspend and resume are each
run only once in every suspend-to-idle cycle, which is reported to
significantly reduce power drawn by some systems when suspended to
idle (by allowing them to reach a deep platform-wide low-power state
through the suspend-to-idle flow).  [What appears to happen is that
the "noirq" resume of devices after a spurious EC wakeup brings some
devices into a state in which they prevent the platform from reaching
the deep low-power state going forward, even after a subsequent
"noirq" suspend phase, and on some systems the EC triggers such
wakeups already when the "noirq" suspend of devices is running for
the first time in the given suspend/resume cycle, so the platform
cannot reach the deep low-power state at all.]

First, make acpi_s2idle_wake() use the acpi_ec_dispatch_gpe() return
value to determine whether or not the wakeup may have been triggered
by the EC (in which case the system wakeup is canceled and ACPI
events are processed in order to determine whether or not the event
is a proper system wakeup one) and use rearm_wake_irq() (introduced
by a previous change) in it to rearm the ACPI SCI for system wakeup
detection in case the system will remain suspended.

Second, drop acpi_s2idle_sync(), which is not needed any more, and
the corresponding global platform suspend-to-idle callback.

Next, drop the pm_wakeup_pending() check (which is an optimization
only) from __device_suspend_noirq() to prevent it from returning
errors on system wakeups occurring before the "noirq" phase of
device suspend is complete (as in the case of suspend-to-idle it is
not known whether or not these wakeups are suprious at that point),
in order to avoid having to carry out a "noirq" resume of devices
on a spurious system wakeup.

Finally, change the code flow in s2idle_loop() to (1) run the
"noirq" suspend of devices once before starting the loop, (2) check
for spurious EC wakeups (via the platform ->wake callback) for the
first time before calling s2idle_enter(), and (3) run the "noirq"
resume of devices once after leaving the loop.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:46:40 +02:00
Rafael J. Wysocki
3a79bc63d9 PCI: irq: Introduce rearm_wake_irq()
Introduce a new function, rearm_wake_irq(), allowing a wakeup IRQ
to be armed for systen wakeup detection again without running any
action handlers associated with it after it has been armed for
wakeup detection and triggered.

That is useful for IRQs, like ACPI SCI, that may deliver wakeup
as well as non-wakeup interrupts when armed for systen wakeup
detection.  In those cases, it may be possible to determine whether
or not the delivered interrupt is a systen wakeup one without
running the entire action handler (or handlers, if the IRQ is
shared) for the IRQ, and if the interrupt turns out to be a
non-wakeup one, the IRQ can be rearmed with the help of the
new function.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:37:51 +02:00
Linus Torvalds
7b5cf701ea Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull preemption Kconfig fix from Thomas Gleixner:
 "The PREEMPT_RT stub config renamed PREEMPT to PREEMPT_LL and defined
  PREEMPT outside of the menu and made it selectable by both PREEMPT_LL
  and PREEMPT_RT.

  Stupid me missed that 114 defconfigs select CONFIG_PREEMPT which
  obviously can't work anymore. oldconfig builds are affected as well,
  but it's more obvious as the user gets asked. [old]defconfig silently
  fixes it up and selects PREEMPT_NONE.

  Unbreak it by undoing the rename and adding a intermediate config
  symbol which is selected by both PREEMPT and PREEMPT_RT. That requires
  to chase down a few #ifdefs, but it's better than tweaking 114
  defconfigs and annoying users"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
2019-07-22 09:30:34 -07:00
Thomas Gleixner
b8d3349803 sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
The merge of the CONFIG_PREEMPT_RT stub renamed CONFIG_PREEMPT to
CONFIG_PREEMPT_LL which causes all defconfigs which have CONFIG_PREEMPT=y
set to fall back to CONFIG_PREEMPT_NONE because CONFIG_PREEMPT depends on
the preemption mode choice wich defaults to NONE. This also affects
oldconfig builds.

So rather than changing 114 defconfig files and being an annoyance to
users, revert the rename and select a new config symbol PREEMPTION. That
keeps everything working smoothly and the revelant ifdef's are going to be
fixed up step by step.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: a50a3f4b6a ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22 18:05:11 +02:00
Suren Baghdasaryan
b191d6491b
pidfd: fix a poll race when setting exit_state
There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-22 16:02:03 +02:00
Dave Hansen
f240652b60 x86/mpx: Remove MPX APIs
MPX is being removed from the kernel due to a lack of support in the
toolchain going forward (gcc).

The first step is to remove the userspace-visible ABIs so that applications
will stop using it.  The most visible one are the enable/disable prctl()s.
Remove them first.

This is the most minimal and least invasive change needed to ensure that
apps stop using MPX with new kernels.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com
2019-07-22 11:54:57 +02:00
Linus Torvalds
ac60602a6d dma-mapping fixes for 5.3-rc1
Fix various regressions:
 
  - force unencrypted dma-coherent buffers if encryption bit can't fit
    into the dma coherent mask (Tom Lendacky)
  - avoid limiting request size if swiotlb is not used (me)
  - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device
    (Fugang Duan)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl0zTvELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMAsQ/6AleklMsMbc1xPsYYMukmjAOUNf+nvsFG4PRs/KVn
 1/Yohkxx/FN3oXZ+zZEnyd8a5u0ghwkN1WDivEhpclzbDuQP+Z+jEDmb37Oea4aJ
 L6XRLQJYiFwwEA6oJ87FNVMZXK/QUo+/lnDvJg0xNW6+HiR4GAUmnqy+/KyEIRSf
 SX+aiUOX4tUkwHPWyMaWvTlZ4hZgSovXwkUnR08jCwyJFezUwJBr/Yf5G6M1C10B
 hPFTrREhaekXgFd5E1dwKNk5omvfihxGyVUujFZhtMvs//LP8GcFLcVtYRWM/SUZ
 XpKkXxnaRC0gEm2P4/tSEGL3xl1CST/oYde74KNBQDIe0svGFS0QrP68+4zu/1ih
 vaf2gHoCoJciFY2DHglw1OG/gMWW06OtdseOKe9LZXtsGA6HCVBZW4c01V5YHVQT
 TMQMr0UyxJzmrxCo+LafAf9DoQxIii8WapewomwceL0TUtIDIujirzC/ieLhNPKL
 L2Fk+zPtFL24IpVe52S1PngatlW4MioiyiJji1QM0RK1V68+r/nSKPBxeq9s+jR3
 CfGvfhfRDd/NbZ9m66YFUaRzHL6Fpi2hMvJc9O6dgcVEYEBrL0d8J9nH42cqOlfe
 OBGeCxnFNQMuBp4Tw1OZO9PjzR3+pQOb32pOWLDUUs9ed3gtdMrJYTKhw9/cLpyp
 838=
 =Bv+Q
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix various regressions:

   - force unencrypted dma-coherent buffers if encryption bit can't fit
     into the dma coherent mask (Tom Lendacky)

   - avoid limiting request size if swiotlb is not used (me)

   - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang
     Duan)"

* tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
  dma-direct: only limit the mapping size if swiotlb could be used
  dma-mapping: add a dma_addressing_limited helper
  dma-direct: Force unencrypted DMA under SME for certain DMA masks
2019-07-20 12:09:52 -07:00
Linus Torvalds
e6023adc5c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:

 - A collection of objtool fixes which address recent fallout partially
   exposed by newer toolchains, clang, BPF and general code changes.

 - Force USER_DS for user stack traces

[ Note: the "objtool fixes" are not all to objtool itself, but for
  kernel code that triggers objtool warnings.

  Things like missing function size annotations, or code that confuses
  the unwinder etc.   - Linus]

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  objtool: Support conditional retpolines
  objtool: Convert insn type to enum
  objtool: Fix seg fault on bad switch table entry
  objtool: Support repeated uses of the same C jump table
  objtool: Refactor jump table code
  objtool: Refactor sibling call detection logic
  objtool: Do frame pointer check before dead end check
  objtool: Change dead_end_function() to return boolean
  objtool: Warn on zero-length functions
  objtool: Refactor function alias logic
  objtool: Track original function across branches
  objtool: Add mcsafe_handle_tail() to the uaccess safe list
  bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
  x86/uaccess: Remove redundant CLACs in getuser/putuser error paths
  x86/uaccess: Don't leak AC flag into fentry from mcsafe_handle_tail()
  x86/uaccess: Remove ELF function annotation from copy_user_handle_tail()
  x86/head/64: Annotate start_cpu0() as non-callable
  x86/entry: Fix thunk function ELF sizes
  x86/kvm: Don't call kvm_spurious_fault() from .fixup
  x86/kvm: Replace vmx_vmenter()'s call to kvm_spurious_fault() with UD2
  ...
2019-07-20 10:45:15 -07:00
Linus Torvalds
4b01f5a4c9 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fix from Thomas Gleixner:
 "Add warnings to the smp function calls so callers from wrong contexts
  get detected"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Warn on function calls from softirq context
2019-07-20 10:43:03 -07:00
Linus Torvalds
70e6e1b971 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CONFIG_PREEMPT_RT stub config from Thomas Gleixner:
 "The real-time preemption patch set exists for almost 15 years now and
  while the vast majority of infrastructure and enhancements have found
  their way into the mainline kernel, the final integration of RT is
  still missing.

  Over the course of the last few years, we have worked on reducing the
  intrusivenness of the RT patches by refactoring kernel infrastructure
  to be more real-time friendly. Almost all of these changes were
  benefitial to the mainline kernel on their own, so there was no
  objection to integrate them.

  Though except for the still ongoing printk refactoring, the remaining
  changes which are required to make RT a first class mainline citizen
  are not longer arguable as immediately beneficial for the mainline
  kernel. Most of them are either reordering code flows or adding RT
  specific functionality.

  But this now has hit a wall and turned into a classic hen and egg
  problem:

     Maintainers are rightfully wary vs. these changes as they make only
     sense if the final integration of RT into the mainline kernel takes
     place.

  Adding CONFIG_PREEMPT_RT aims to solve this as a clear sign that RT
  will be fully integrated into the mainline kernel. The final
  integration of the missing bits and pieces will be of course done with
  the same careful approach as we have used in the past.

  While I'm aware that you are not entirely enthusiastic about that, I
  think that RT should receive the same treatment as any other widely
  used out of tree functionality, which we have accepted into mainline
  over the years.

  RT has become the de-facto standard real-time enhancement and is
  shipped by enterprise, embedded and community distros. It's in use
  throughout a wide range of industries: telecommunications, industrial
  automation, professional audio, medical devices, data acquisition,
  automotive - just to name a few major use cases.

  RT development is backed by a Linuxfoundation project which is
  supported by major stakeholders of this technology. The funding will
  continue over the actual inclusion into mainline to make sure that the
  functionality is neither introducing regressions, regressing itself,
  nor becomes subject to bitrot. There is also a lifely user community
  around RT as well, so contrary to the grim situation 5 years ago, it's
  a healthy project.

  As RT is still a good vehicle to exercise rarely used code paths and
  to detect hard to trigger issues, you could at least view it as a QA
  tool if nothing else"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
2019-07-20 10:33:44 -07:00
Linus Torvalds
07ab9d5bc5 Mostly bugfixes, but also:
- s390 support for KVM selftests
 - LAPIC timer offloading to housekeeping CPUs
 - Extend an s390 optimization for overcommitted hosts to all architectures
 - Debugging cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdMr1FAAoJEL/70l94x66DvIkH/iVuUX9jO1NoQ7qhxeo04MnT
 GP9mX3XnWoI/iN0zAIRfQSP2/9a6+KblgdiziABhju58j5dCfAZGb5793TQppweb
 3ubl11vy7YkzaXJ0b35K7CFhOU9oSlHHGyi5Uh+yyje5qWNxwmHpizxjynbFTKb6
 +/S7O2Ua1VrAVvx0i0IRtwanIK/jF4dStVButgVaVdUva3zLaQmeI71iaJl9ddXY
 bh50xoYua5Ek6+ENi+nwCNVy4OF152AwDbXlxrU0QbeA1B888Qio7nIqb3bwwPpZ
 /8wMVvPzQgL7RmgtY5E5Z4cCYuu7mK8wgGxhuk3oszlVwZJ5rmnaYwGEl4x1s7o=
 =giag
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Mostly bugfixes, but also:

   - s390 support for KVM selftests

   - LAPIC timer offloading to housekeeping CPUs

   - Extend an s390 optimization for overcommitted hosts to all
     architectures

   - Debugging cleanups and improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: x86: Add fixed counters to PMU filter
  KVM: nVMX: do not use dangling shadow VMCS after guest reset
  KVM: VMX: dump VMCS on failed entry
  KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
  KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
  KVM: Boost vCPUs that are delivering interrupts
  KVM: selftests: Remove superfluous define from vmx.c
  KVM: SVM: Fix detection of AMD Errata 1096
  KVM: LAPIC: Inject timer interrupt via posted interrupt
  KVM: LAPIC: Make lapic timer unpinned
  KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
  KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
  kvm: x86: ioapic and apic debug macros cleanup
  kvm: x86: some tsc debug cleanup
  kvm: vmx: fix coccinelle warnings
  x86: kvm: avoid constant-conversion warning
  x86: kvm: avoid -Wsometimes-uninitized warning
  KVM: x86: expose AVX512_BF16 feature to guest
  KVM: selftests: enable pgste option for the linker on s390
  KVM: selftests: Move kvm_create_max_vcpus test to generic code
  ...
2019-07-20 10:20:27 -07:00
Peter Zijlstra
19dbdcb803 smp: Warn on function calls from softirq context
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.

A single function call can be invoked from softirq context only via
smp_call_function_single_async().

The only legit context is task context, so add a warning to that effect.

Reported-by: luferry <luferry@163.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass.net
2019-07-20 11:27:16 +02:00
Wanpeng Li
0c5f81dad4 KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside.  There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits.  This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially.  Without patch

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

While with patch:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20 09:00:40 +02:00
Linus Torvalds
dd4542d282 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:

 - Fix missed wake-up race in padata

 - Use crypto_memneq in ccp

 - Fix version check in ccp

 - Fix fuzz test failure in ccp

 - Fix potential double free in crypto4xx

 - Fix compile warning in stm32

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
  crypto: ccp - Fix SEV_VERSION_GREATER_OR_EQUAL
  crypto: ccp/gcm - use const time tag comparison.
  crypto: ccp - memset structure fields to zero before reuse
  crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
  crypto: stm32/hash - Fix incorrect printk modifier for size_t
2019-07-19 12:23:37 -07:00
Linus Torvalds
41ba485ef1 Eiichi Tsukata found a small bug from the fixup of the stack code
Removing ULONG_MAX as the marker for the user stack trace end,
 made the tracing code not know where the end is. The end is now
 marked with a zero (NULL) pointer. Eiichi fixed this in the tracing
 code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXTHsuRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgETAQDqRtu1KhJM6ujNlPY1aw6e9ncDAqWn
 6GaumMgAdBqEcAEAxJSjr5UlzXuJsCjUjwE0txLfTscyNwljKW77h4/WNwA=
 =bwtH
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Eiichi Tsukata found a small bug from the fixup of the stack code

  Removing ULONG_MAX as the marker for the user stack trace end, made
  the tracing code not know where the end is. The end is now marked with
  a zero (NULL) pointer. Eiichi fixed this in the tracing code"

* tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix user stack trace "??" output
2019-07-19 12:18:46 -07:00
Linus Torvalds
4f5ed1318c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  perf_event_get(): don't bother with fget_raw()
  vfs: update d_make_root() description
2019-07-19 11:35:08 -07:00
Linus Torvalds
933a90bf4f Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro:
 "The first part of mount updates.

  Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  mnt_init(): call shmem_init() unconditionally
  constify ksys_mount() string arguments
  don't bother with registering rootfs
  init_rootfs(): don't bother with init_ramfs_fs()
  vfs: Convert smackfs to use the new mount API
  vfs: Convert selinuxfs to use the new mount API
  vfs: Convert securityfs to use the new mount API
  vfs: Convert apparmorfs to use the new mount API
  vfs: Convert openpromfs to use the new mount API
  vfs: Convert xenfs to use the new mount API
  vfs: Convert gadgetfs to use the new mount API
  vfs: Convert oprofilefs to use the new mount API
  vfs: Convert ibmasmfs to use the new mount API
  vfs: Convert qib_fs/ipathfs to use the new mount API
  vfs: Convert efivarfs to use the new mount API
  vfs: Convert configfs to use the new mount API
  vfs: Convert binfmt_misc to use the new mount API
  convenience helper: get_tree_single()
  convenience helper get_tree_nodev()
  vfs: Kill sget_userns()
  ...
2019-07-19 10:42:02 -07:00
Linus Torvalds
5f4fc6d440 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix AF_XDP cq entry leak, from Ilya Maximets.

 2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.

 3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.

 4) Fix handling of neigh timers wrt. entries added by userspace, from
    Lorenzo Bianconi.

 5) Various cases of missing of_node_put(), from Nishka Dasgupta.

 6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.

 7) Various RDS layer fixes, from Gerd Rausch.

 8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
    Cong Wang.

 9) Fix FIB source validation checks over loopback, also from Cong Wang.

10) Use promisc for unsupported number of filters, from Justin Chen.

11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  tcp: fix tcp_set_congestion_control() use from bpf hook
  ag71xx: fix return value check in ag71xx_probe()
  ag71xx: fix error return code in ag71xx_probe()
  usb: qmi_wwan: add D-Link DWM-222 A2 device ID
  bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
  net: dsa: sja1105: Fix missing unlock on error in sk_buff()
  gve: replace kfree with kvfree
  selftests/bpf: fix test_xdp_noinline on s390
  selftests/bpf: fix "valid read map access into a read-only array 1" on s390
  net/mlx5: Replace kfree with kvfree
  MAINTAINERS: update netsec driver
  ipv6: Unlink sibling route in case of failure
  liquidio: Replace vmalloc + memset with vzalloc
  udp: Fix typo in net/ipv4/udp.c
  net: bcmgenet: use promisc for unsupported filters
  ipv6: rt6_check should return NULL if 'from' is NULL
  tipc: initialize 'validated' field of received packets
  selftests: add a test case for rp_filter
  fib: relax source validation check for loopback packets
  mlxsw: spectrum: Do not process learned records with a dummy FID
  ...
2019-07-19 10:06:06 -07:00
Linus Torvalds
249be8511b Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "The rest of MM and a kernel-wide procfs cleanup.

  Summary of the more significant patches:

   - Patch series "mm/memory_hotplug: Factor out memory block
     devicehandling", v3. David Hildenbrand.

     Some spring-cleaning of the memory hotplug code, notably in
     drivers/base/memory.c

   - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
     Shi.

     Fix /proc/pid/smaps output for THP pages used in shmem.

   - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

     Bugfix and speedup for kernel/resource.c

   - Patch series "mm: Further memory block device cleanups", David
     Hildenbrand.

     More spring-cleaning of the memory hotplug code.

   - Patch series "mm: Sub-section memory hotplug support". Dan
     Williams.

     Generalise the memory hotplug code so that pmem can use it more
     completely. Then remove the hacks from the libnvdimm code which
     were there to work around the memory-hotplug code's constraints.

   - "proc/sysctl: add shared variables for range check", Matteo Croce.

     We have about 250 instances of

          int zero;
          ...
                  .extra1 = &zero,

     in the tree. This is a tree-wide sweep to make all those private
     "zero"s and "one"s use global variables.

     Alas, it isn't practical to make those two global integers const"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
  proc/sysctl: add shared variables for range check
  mm: migrate: remove unused mode argument
  mm/sparsemem: cleanup 'section number' data types
  libnvdimm/pfn: stop padding pmem namespaces to section alignment
  libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
  mm/devm_memremap_pages: enable sub-section remap
  mm: document ZONE_DEVICE memory-model implications
  mm/sparsemem: support sub-section hotplug
  mm/sparsemem: prepare for sub-section ranges
  mm: kill is_dev_zone() helper
  mm/hotplug: kill is_dev_zone() usage in __remove_pages()
  mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
  mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
  mm/sparsemem: add helpers track active portions of a section at boot
  mm/sparsemem: introduce a SECTION_IS_EARLY flag
  mm/sparsemem: introduce struct mem_section_usage
  drivers/base/memory.c: get rid of find_memory_block_hinted()
  mm/memory_hotplug: move and simplify walk_memory_blocks()
  mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
  mm: make register_mem_sect_under_node() static
  ...
2019-07-19 09:45:58 -07:00
Eiichi Tsukata
6d54ceb539 tracing: Fix user stack trace "??" output
Commit c5c27a0a58 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".

For example:

            less-1911  [001] d..2    34.758944: <user stack trace>
   =>  <00007f16f2295910>
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??

The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.

Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: 4285f2fcef ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-19 12:12:39 -04:00
Fugang Duan
449fa54d68 dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
dma_map_sg() may use swiotlb buffer when the kernel command line includes
"swiotlb=force" or the dma_addr is out of dev->dma_mask range.  After
DMA complete the memory moving from device to memory, then user call
dma_sync_sg_for_cpu() to sync with DMA buffer, and copy the original
virtual buffer to other space.

So dma_direct_sync_sg_for_cpu() should use swiotlb physical addr, not
the original physical addr from sg_phys(sg).

dma_direct_sync_sg_for_device() also has the same issue, correct it as
well.

Fixes: 55897af63091("dma-direct: merge swiotlb_dma_ops into the dma_direct code")
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-19 14:09:40 +02:00
Matteo Croce
eec4844fae proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Dan Williams
7cc7867fb0 mm/devm_memremap_pages: enable sub-section remap
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory().  Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res).  The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).

Link: http://lkml.kernel.org/r/156092355542.979959.10060071713397030576.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Nadav Amit
756398750e resource: avoid unnecessary lookups in find_next_iomem_res()
find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.

Improve performance by not considering children of the tree if the top
level does not match.  Since the range of the parents should include the
range of the children such check is redundant.

Running sysbench on dax (pmem emulation, with write_cache disabled):

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

		events (avg/stddev)
		-------------------
  5.2-rc3:	1247669.0000/16075.39
  w/patch:	1286320.5000/16402.72	(+3%)

Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Nadav Amit
49f17c26c1 resource: fix locking in find_next_iomem_res()
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it.  However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.

Keep holding the lock while the data is copied.  While at it, change the
return value to a more informative value.  It is disregarded by the
callers.

[akpm@linux-foundation.org: fix find_next_iomem_res() documentation]
Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com
Fixes: ff3cc952d3 ("resource: Add remove_resource interface")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Thomas Gleixner
a50a3f4b6a sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
Add a new entry to the preemption menu which enables the real-time support
for the kernel. The choice is only enabled when an architecture supports
it.

It selects PREEMPT as the RT features depend on it. To achieve that the
existing PREEMPT choice is renamed to PREEMPT_LL which select PREEMPT as
well.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Clark Williams <williams@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Daniel Wagner <wagi@monom.org>
Acked-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Acked-by: Julia Cartwright <julia@ni.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Acked-by: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1907172200190.1778@nanos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-18 23:10:57 +02:00
David S. Miller
bb74523167 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) verifier precision propagation fix, from Andrii.

2) BTF size fix for typedefs, from Andrii.

3) a bunch of big endian fixes, from Ilya.

4) wide load from bpf_sock_addr fixes, from Stanislav.

5) a bunch of misc fixes from a number of developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18 14:04:45 -07:00
Linus Torvalds
da0acd7c65 Modules updates for v5.3
Summary of modules changes for the 5.3 merge window:
 
 - Code fixes and cleanups
 
 - Fix bug where set_memory_x() wasn't being called when rodata=n
 
 - Fix bug where -EEXIST was being returned for going modules
 
 - Allow arches to override module_exit_section()
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdMET+AAoJEMBFfjjOO8FyWg4P/RFUwHUxdpouLlVjlJVx40ko
 WWFMjQ7zDjhWJVvqataQJS3L3yqhZEC+KrSetqKgP0NxKQ2ev4BwgGF822yorQRw
 Aorw/aHk9xIyfv7tUEjQgLpa7j1zv15uLx+Y4mkRSgc1uZve0MYKgMsU4db97pTk
 XUyronakoM9h4z7NVyV21F9dgprnytzdWwgYdjEx2aMPzF/+3MAPN6H7d8Cb8v4X
 85UwPb8SB5F3U/wzYowrHmlFTlAOkroUrZRErwUOlCn8rvZefkt4NnJwShBC7YPa
 N0CVemw6sgy5KcDUczL7oQJ+kyYwtHN0i0EhvTMixPFKeeCI96Y+UNPZxxQANd6f
 0djYqvRtfqAp7b2bRSfSiApRdxMO3rl17tY/c1MiIb4/Yk5BtofX+8edW3eMX3Xc
 eL5EDL69idgjx1llXmScmjbWiy7Mg61mmo6PN8T/wyIqtDQ/CXqovHdh8Ehbubuf
 5cC7fQiVSRhqvg0N4w2FEr+hV+VpEjdnfxypd46ABDVhYYir7ybMLoHYwBimwKSD
 70qOJWfiRjnEES0qaO3JH/nlXhQGSnPpfe7kwx0LAIyIdTTfarPjGbK/5wcNcReA
 f3A7R2p//G7sNg3DWYSG1cKEjgOAwMjD00SOBov82K9VQ8spwMexIo93Hj6BEbY5
 5cex2njf5G2YNKDfAvoi
 =/sk3
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 5.3 merge window:

   - Code fixes and cleanups

   - Fix bug where set_memory_x() wasn't being called when rodata=n

   - Fix bug where -EEXIST was being returned for going modules

   - Allow arches to override module_exit_section()"

* tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  modules: fix compile error if don't have strict module rwx
  ARM: module: recognize unwind exit sections
  module: allow arch overrides for .exit section names
  modules: fix BUG when load module with rodata=n
  kernel/module: Fix mem leak in module_add_modinfo_attrs
  kernel: module: Use struct_size() helper
  kernel/module.c: Only return -EEXIST for modules that have finished loading
2019-07-18 12:06:57 -07:00
Josh Poimboeuf
3193c0836f bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression
elimination" optimization results in ___bpf_prog_run()'s jumptable code
changing from this:

	select_insn:
		jmp *jumptable(, %rax, 8)
		...
	ALU64_ADD_X:
		...
		jmp *jumptable(, %rax, 8)
	ALU_ADD_X:
		...
		jmp *jumptable(, %rax, 8)

to this:

	select_insn:
		mov jumptable, %r12
		jmp *(%r12, %rax, 8)
		...
	ALU64_ADD_X:
		...
		jmp *(%r12, %rax, 8)
	ALU_ADD_X:
		...
		jmp *(%r12, %rax, 8)

The jumptable address is placed in a register once, at the beginning of
the function.  The function execution can then go through multiple
indirect jumps which rely on that same register value.  This has a few
issues:

1) Objtool isn't smart enough to be able to track such a register value
   across multiple recursive indirect jumps through the jump table.

2) With CONFIG_RETPOLINE enabled, this optimization actually results in
   a small slowdown.  I measured a ~4.7% slowdown in the test_bpf
   "tcpdump port 22" selftest.

   This slowdown is actually predicted by the GCC manual:

     Note: When compiling a program using computed gotos, a GCC
     extension, you may get better run-time performance if you
     disable the global common subexpression elimination pass by
     adding -fno-gcse to the command line.

So just disable the optimization for this function.

Fixes: e55a73251d ("bpf: Fix ORC unwinding in non-JIT BPF code")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/30c3ca29ba037afcbd860a8672eef0021addf9fe.1563413318.git.jpoimboe@redhat.com
2019-07-18 21:01:06 +02:00
Linus Torvalds
818e95c768 The main changes in this release include:
- Add user space specific memory reading for kprobes
  - Allow kprobes to be executed earlier in boot
 
 The rest are mostly just various clean ups and small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXS88txQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhaPAQDHaAmu6wXtZjZE6GU4ZP61UNgDECmZ
 4wlGrNc1AAlqAQD/QC8339p37aDCp9n27VY1wmJwF3nca+jAHfQLqWkkYgw=
 =n/tz
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The main changes in this release include:

   - Add user space specific memory reading for kprobes

   - Allow kprobes to be executed earlier in boot

  The rest are mostly just various clean ups and small fixes"

* tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  tracing: Make trace_get_fields() global
  tracing: Let filter_assign_type() detect FILTER_PTR_STRING
  tracing: Pass type into tracing_generic_entry_update()
  ftrace/selftest: Test if set_event/ftrace_pid exists before writing
  ftrace/selftests: Return the skip code when tracing directory not configured in kernel
  tracing/kprobe: Check registered state using kprobe
  tracing/probe: Add trace_event_call accesses APIs
  tracing/probe: Add probe event name and group name accesses APIs
  tracing/probe: Add trace flag access APIs for trace_probe
  tracing/probe: Add trace_event_file access APIs for trace_probe
  tracing/probe: Add trace_event_call register API for trace_probe
  tracing/probe: Add trace_probe init and free functions
  tracing/uprobe: Set print format when parsing command
  tracing/kprobe: Set print format right after parsed command
  kprobes: Fix to init kprobes in subsys_initcall
  tracepoint: Use struct_size() in kmalloc()
  ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
  ftrace: Enable trampoline when rec count returns back to one
  tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdline
  tracing: Make a separate config for trace event self tests
  ...
2019-07-18 11:51:00 -07:00
Thomas Gleixner
54f698f31e Merge branch 'x86/debug' into core/urgent
Pick up the two pending objtool patches as the next round of objtool fixes
depend on them.
2019-07-18 20:50:48 +02:00
Linus Torvalds
d4df33b0e9 Merge branch 'for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb updates from Konrad Rzeszutek Wilk:
 "One compiler fix, and a bug-fix in swiotlb_nr_tbl() and
  swiotlb_max_segment() to check also for no_iotlb_memory"

* 'for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb: fix phys_addr_t overflow warning
  swiotlb: Return consistent SWIOTLB segments/nr_tbl
  swiotlb: Group identical cleanup in swiotlb_cleanup()
2019-07-18 11:48:05 -07:00
Peter Zijlstra
cac9b9a4b0 stacktrace: Force USER_DS for stack_trace_save_user()
When walking userspace stacks, USER_DS needs to be set, otherwise
access_ok() will not function as expected.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20190718085754.GM3402@hirez.programming.kicks-ass.net
2019-07-18 16:47:24 +02:00
Daniel Jordan
cf144f81a9 padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
Testing padata with the tcrypt module on a 5.2 kernel...

    # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
    # modprobe tcrypt mode=211 sec=1

...produces this splat:

    INFO: task modprobe:10075 blocked for more than 120 seconds.
          Not tainted 5.2.0-base+ #16
    modprobe        D    0 10075  10064 0x80004080
    Call Trace:
     ? __schedule+0x4dd/0x610
     ? ring_buffer_unlock_commit+0x23/0x100
     schedule+0x6c/0x90
     schedule_timeout+0x3b/0x320
     ? trace_buffer_unlock_commit_regs+0x4f/0x1f0
     wait_for_common+0x160/0x1a0
     ? wake_up_q+0x80/0x80
     { crypto_wait_req }             # entries in braces added by hand
     { do_one_aead_op }
     { test_aead_jiffies }
     test_aead_speed.constprop.17+0x681/0xf30 [tcrypt]
     do_test+0x4053/0x6a2b [tcrypt]
     ? 0xffffffffa00f4000
     tcrypt_mod_init+0x50/0x1000 [tcrypt]
     ...

The second modprobe command never finishes because in padata_reorder,
CPU0's load of reorder_objects is executed before the unlocking store in
spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment:

CPU0                                 CPU1

padata_reorder                       padata_do_serial
  LOAD reorder_objects  // 0
                                       INC reorder_objects  // 1
                                       padata_reorder
                                         TRYLOCK pd->lock   // failed
  UNLOCK pd->lock

CPU0 deletes the timer before returning from padata_reorder and since no
other job is submitted to padata, modprobe waits indefinitely.

Add a pair of full barriers to guarantee proper ordering:

CPU0                                 CPU1

padata_reorder                       padata_do_serial
  UNLOCK pd->lock
  smp_mb()
  LOAD reorder_objects
                                       INC reorder_objects
                                       smp_mb__after_atomic()
                                       padata_reorder
                                         TRYLOCK pd->lock

smp_mb__after_atomic is needed so the read part of the trylock operation
comes after the INC, as Andrea points out.   Thanks also to Andrea for
help with writing a litmus test.

Fixes: 16295bec63 ("padata: Generic parallelization/serialization interface")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: <stable@vger.kernel.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-18 13:39:54 +08:00
Linus Torvalds
57a8ec387e Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "VM:
   - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

   - more accurate reclaimed slab caches calculations by Yafang Shao

   - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
     Christoph Hellwig

   - !CONFIG_MMU fixes by Christoph Hellwig

   - new novmcoredd parameter to omit device dumps from vmcore, by
     Kairui Song

   - new test_meminit module for testing heap and pagealloc
     initialization, by Alexander Potapenko

   - ioremap improvements for huge mappings, by Anshuman Khandual

   - generalize kprobe page fault handling, by Anshuman Khandual

   - device-dax hotplug fixes and improvements, by Pavel Tatashin

   - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

   - add pte_devmap() support for arm64, by Robin Murphy

   - unify locked_vm accounting with a helper, by Daniel Jordan

   - several misc fixes

  core/lib:
   - new typeof_member() macro including some users, by Alexey Dobriyan

   - make BIT() and GENMASK() available in asm, by Masahiro Yamada

   - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
     code generation, by Alexey Dobriyan

   - rbtree code size optimizations, by Michel Lespinasse

   - convert struct pid count to refcount_t, by Joel Fernandes

  get_maintainer.pl:
   - add --no-moderated switch to skip moderated ML's, by Joe Perches

  misc:
   - ptrace PTRACE_GET_SYSCALL_INFO interface

   - coda updates

   - gdb scripts, various"

[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  fs/select.c: use struct_size() in kmalloc()
  mm: add account_locked_vm utility function
  arm64: mm: implement pte_devmap support
  mm: introduce ARCH_HAS_PTE_DEVMAP
  mm: clean up is_device_*_page() definitions
  mm/mmap: move common defines to mman-common.h
  mm: move MAP_SYNC to asm-generic/mman-common.h
  device-dax: "Hotremove" persistent memory that is used like normal RAM
  mm/hotplug: make remove_memory() interface usable
  device-dax: fix memory and resource leak if hotplug fails
  include/linux/lz4.h: fix spelling and copy-paste errors in documentation
  ipc/mqueue.c: only perform resource calculation if user valid
  include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
  scripts/gdb: add helpers to find and list devices
  scripts/gdb: add lx-genpd-summary command
  drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
  kernel/pid.c: convert struct pid count to refcount_t
  drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
  select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
  select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
  ...
2019-07-17 08:58:04 -07:00
Christoph Hellwig
a5008b59cd dma-direct: only limit the mapping size if swiotlb could be used
Don't just check for a swiotlb buffer, but also if buffering might
be required for this particular device.

Fixes: 133d624b1c ("dma: Introduce dma_max_mapping_size()")
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2019-07-17 08:25:45 +02:00
Joel Fernandes (Google)
f57e515a1b kernel/pid.c: convert struct pid count to refcount_t
struct pid's count is an atomic_t field used as a refcount.  Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.

For memory ordering, the only change is with the following:

 -	if ((atomic_read(&pid->count) == 1) ||
 -	     atomic_dec_and_test(&pid->count)) {
 +	if (refcount_dec_and_test(&pid->count)) {
 		kmem_cache_free(ns->pid_cachep, pid);

Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().

The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is.  The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.

Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Oleg Nesterov
b772434be0 signal: simplify set_user_sigmask/restore_user_sigmask
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Elvira Khabirova
201766a20e ptrace: add PTRACE_GET_SYSCALL_INFO request
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.

There are two reasons for a special syscall-related ptrace request.

Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls.  Some examples include:

 * The notorious int-0x80-from-64-bit-task issue. See [1] for details.
   In short, if a 64-bit task performs a syscall through int 0x80, its
   tracer has no reliable means to find out that the syscall was, in
   fact, a compat syscall, and misidentifies it.

 * Syscall-enter-stop and syscall-exit-stop look the same for the
   tracer. Common practice is to keep track of the sequence of
   ptrace-stops in order not to mix the two syscall-stops up. But it is
   not as simple as it looks; for example, strace had a (just recently
   fixed) long-standing bug where attaching strace to a tracee that is
   performing the execve system call led to the tracer identifying the
   following syscall-exit-stop as syscall-enter-stop, which messed up
   all the state tracking.

 * Since the introduction of commit 84d77d3f06 ("ptrace: Don't allow
   accessing an undumpable mm"), both PTRACE_PEEKDATA and
   process_vm_readv become unavailable when the process dumpable flag is
   cleared. On such architectures as ia64 this results in all syscall
   arguments being unavailable for the tracer.

Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee.  For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.

ptrace(2) man page:

long ptrace(enum __ptrace_request request, pid_t pid,
            void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
       Retrieve information about the syscall that caused the stop.
       The information is placed into the buffer pointed by "data"
       argument, which should be a pointer to a buffer of type
       "struct ptrace_syscall_info".
       The "addr" argument contains the size of the buffer pointed to
       by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
       The return value contains the number of bytes available
       to be written by the kernel.
       If the size of data to be written by the kernel exceeds the size
       specified by "addr" argument, the output is truncated.

[ldv@altlinux.org: selftests/seccomp/seccomp_bpf: update for PTRACE_GET_SYSCALL_INFO]
  Link: http://lkml.kernel.org/r/20190708182904.GA12332@altlinux.org
Link: http://lkml.kernel.org/r/20190510152842.GF28558@altlinux.org
Signed-off-by: Elvira Khabirova <lineprinter@altlinux.org>
Co-developed-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Greentime Hu <greentime@andestech.com>
Cc: Helge Deller <deller@gmx.de>	[parisc]
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Weitao Hou
65f50f2553 kernel: fix typos and some coding style in comments
fix lenght to length

Link: http://lkml.kernel.org/r/20190521050937.4370-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou <houweitaoo@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:21 -07:00
Tom Lendacky
9087c37584 dma-direct: Force unencrypted DMA under SME for certain DMA masks
If a device doesn't support DMA to a physical address that includes the
encryption bit (currently bit 47, so 48-bit DMA), then the DMA must
occur to unencrypted memory. SWIOTLB is used to satisfy that requirement
if an IOMMU is not active (enabled or configured in passthrough mode).

However, commit fafadcd165 ("swiotlb: don't dip into swiotlb pool for
coherent allocations") modified the coherent allocation support in
SWIOTLB to use the DMA direct coherent allocation support. When an IOMMU
is not active, this resulted in dma_alloc_coherent() failing for devices
that didn't support DMA addresses that included the encryption bit.

Addressing this requires changes to the force_dma_unencrypted() function
in kernel/dma/direct.c. Since the function is now non-trivial and
SME/SEV specific, update the DMA direct support to add an arch override
for the force_dma_unencrypted() function. The arch override is selected
when CONFIG_AMD_MEM_ENCRYPT is set. The arch override function resides in
the arch/x86/mm/mem_encrypt.c file and forces unencrypted DMA when either
SEV is active or SME is active and the device does not support DMA to
physical addresses that include the encryption bit.

Fixes: fafadcd165 ("swiotlb: don't dip into swiotlb pool for coherent allocations")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
[hch: moved the force_dma_unencrypted declaration to dma-mapping.h,
      fold the s390 fix from Halil Pasic]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-16 22:15:46 +02:00
Linus Torvalds
c309b6f242 docs conversion for v5.3-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
 Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
 Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
 nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
 DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
 =smxY
 -----END PGP SIGNATURE-----

Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
2019-07-16 12:21:41 -07:00
Cong Wang
0aeb1def44 tracing: Make trace_get_fields() global
trace_get_fields() is the only way to read tracepoint fields at
run time, as their fields are defined at compile-time with macros.
Make this function visible to all users and it will be used by
trace event injection code to calculate the size of a tracepoint
entry.
Link: http://lkml.kernel.org/r/20190525165802.25944-4-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Cong Wang
5967bd5c42 tracing: Let filter_assign_type() detect FILTER_PTR_STRING
filter_assign_type() could detect dynamic string and static
string, but not string pointers. Teach filter_assign_type()
to detect string pointers, and this will be needed by trace
event injection code.

BTW, trace event hist uses FILTER_PTR_STRING too.
Link: http://lkml.kernel.org/r/20190525165802.25944-3-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Cong Wang
46710f3a34 tracing: Pass type into tracing_generic_entry_update()
All callers of tracing_generic_entry_update() have to initialize
entry->type, so let's just simply move it inside.
Link: http://lkml.kernel.org/r/20190525165802.25944-2-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Masami Hiramatsu
715fa2fd4c tracing/kprobe: Check registered state using kprobe
Change registered check only by trace_kprobe and remove
TP_FLAG_REGISTERED from trace_probe, since this feature
is only used for trace_kprobe.

Link: http://lkml.kernel.org/r/155931588704.28323.4952266828256245833.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
e3dc9f898e tracing/probe: Add trace_event_call accesses APIs
Add trace_event_call access APIs for trace_probe.
Instead of accessing trace_probe.call directly, use those
accesses by trace_probe_event_call() method. This hides
the relationship of trace_event_call and trace_probe from
trace_kprobe and trace_uprobe.

Link: http://lkml.kernel.org/r/155931587711.28323.8335129014686133120.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b55ce203a8 tracing/probe: Add probe event name and group name accesses APIs
Add trace_probe_name() and trace_probe_group_name() functions
for accessing probe name and group name of trace_probe.

Link: http://lkml.kernel.org/r/155931586717.28323.8738615064952254761.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
747774d6b0 tracing/probe: Add trace flag access APIs for trace_probe
Add trace_probe_test/set/clear_flag() functions for accessing
trace_probe.flag field.
This flags field should not be accessed directly.

Link: http://lkml.kernel.org/r/155931585683.28323.314290023236905988.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b5f935ee13 tracing/probe: Add trace_event_file access APIs for trace_probe
Add trace_event_file access APIs for trace_probe data structure.
This simplifies enabling/disabling operations in uprobe and kprobe
events so that those don't touch deep inside the trace_probe.

This also removing a redundant synchronization when the
kprobe event is used from perf, since the perf itself uses
tracepoint_synchronize_unregister() after disabling (ftrace-
defined) event, thus we don't have to synchronize in that
path. Also we don't need to identify local trace_kprobe too
anymore.

Link: http://lkml.kernel.org/r/155931584587.28323.372301976283354629.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
46e5376d40 tracing/probe: Add trace_event_call register API for trace_probe
Since trace_event_call is a field of trace_probe, these
operations should be done in trace_probe.c. trace_kprobe
and trace_uprobe use new functions to register/unregister
trace_event_call.

Link: http://lkml.kernel.org/r/155931583643.28323.14828411185591538876.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
455b289973 tracing/probe: Add trace_probe init and free functions
Add common trace_probe init and cleanup function in
trace_probe.c, and use it from trace_kprobe.c and trace_uprobe.c

Link: http://lkml.kernel.org/r/155931582664.28323.5934870189034740822.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b4d4b96be8 tracing/uprobe: Set print format when parsing command
Set event call's print format right after parsed command for
simplifying (un)register_uprobe_event().

Link: http://lkml.kernel.org/r/155931581659.28323.5404667166417404076.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
f730e0f2da tracing/kprobe: Set print format right after parsed command
Set event call's print format right after parsed command for
simplifying (un)register_kprobe_event().

Link: http://lkml.kernel.org/r/155931580625.28323.5158822928646225903.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
65fc965c70 kprobes: Fix to init kprobes in subsys_initcall
Since arm64 kernel initializes breakpoint trap vector in arch_initcall(),
initializing kprobe (and run smoke test) in postcore_initcall() causes
a kernel panic.

To fix this issue, move the kprobe initialization in subsys_initcall()
(which is called right afer the arch_initcall).

In-kernel kprobe users (ftrace and bpf) are using fs_initcall() which is
called after subsys_initcall(), so this shouldn't cause more problem.

Link: http://lkml.kernel.org/r/155956708268.12228.10363800793132214198.stgit@devnote2
Link: http://lkml.kernel.org/r/20190709153755.GB10123@lakrids.cambridge.arm.com

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: b5f8b32c93 ("kprobes: Initialize kprobes at postcore_initcall")
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:13:45 -04:00
Linus Torvalds
3c69914b4c for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSxgkQAKCRCRxhvAZXjc
 opJ3AP9TWIWhEC0dzbNmzh0STj/Vyl5KYEpdMbi7HNeAmIEAfQD6A3HI4bVJbN08
 jH44U7DLzHJyHefKlB8jHEKEVYJWqgo=
 =74Bg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd and clone3 fixes from Christian Brauner:
 "This contains a bugfix for CLONE_PIDFD when used with the legacy clone
  syscall, two fixes to ensure that syscall numbering and clone3
  entrypoint implementations will stay consistent, and an update for the
  maintainers file:

   - The addition of clone3 broke CLONE_PIDFD for legacy clone on all
     architectures that use do_fork() directly instead of calling the
     clone syscall itself. (Fwiw, cleaning do_fork() up is on my todo.)

     The reason this happened was that during conversion of _do_fork()
     to use struct kernel_clone_args we missed that do_fork() is called
     directly by various architectures. This is fixed by making sure
     that the pidfd argument in struct kernel_clone_args is correctly
     initialized with the parent_tidptr argument passed down from
     do_fork(). Additionally, do_fork() missed a check to make
     CLONE_PIDFD and CLONE_PARENT_SETTID mutually exclusive just a
     clone() does. This is now fixed too.

   - When clone3() was introduced we skipped architectures that require
     special handling for fork-like syscalls. Their syscall tables did
     not contain any mention of clone3().

     To make sure that Arnd's work to make syscall numbers on all
     architectures identical (minus alpha) was not for naught we are
     placing a comment in all syscall tables that do not yet implement
     clone3(). The comment makes it clear that 435 is reserved for
     clone3 and should not be used.

   - Also, this contains a patch to make the clone3() syscall definition
     in asm-generic/unist.h conditional on __ARCH_WANT_SYS_CLONE3. This
     lets us catch new architectures that implicitly make use of clone3
     without setting __ARCH_WANT_SYS_CLONE3 which is a good indicator
     that they did not check whether it needs special treatment or not.

   - Finally, this contains a patch to add me as maintainer for pidfd
     stuff so people can start blaming me (more)"

* tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  MAINTAINERS: add new entry for pidfd api
  unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3
  arch: mark syscall number 435 reserved for clone3
  clone: fix CLONE_PIDFD support
2019-07-16 11:30:07 -07:00
Linus Torvalds
fb4da215ed pci-v5.3-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl0siFoUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vzi9A//S4jRyyZrgUr88Az0GbgMhE4b3yqc
 uL7om/Sf+443gG6C+aKkZSM/IE9hrbyIKuYq7GGxDkzZ/HkucZo2yIuAHkPgG4ik
 QQYJ8fJsmMq1bUht87c1ZZwGP0++Deq/Ns2+VNy/WBYqKLulnV0DvEEaJgPs9C5D
 ppwccGdo6UghiujBTpE4ddUBjFjjURWqT6wSnMRDQ4EGwfUhG0MWwwHKI4hbBuaL
 N6refuggdYyUUX5FeUOHa6VF6uTnSSAQ75k+40n4nljdayqoumHLskst77o9q5ZI
 oXjdpwgmuEqYhfp03HEA4Xo/bBxiRj76NuTiEMKvPokxjpanwbLrdV0GhF0OIlM0
 rp1NOI1w+vppFrU+rc2gtq+7hYXFmvdhjS29hFLeD91PP36N5d29jW5NVFpm7GCm
 n4TMGAOsu8RB+bNua6ZbZVcDk2EnPgQeIcM0ZPoBtPK19Fg/rScdEU4u/aFE1Y0Q
 C+Ks7D1qCvFpHzl/xAg0oo9v/jFsWef3qnQWOzot964Zz4W4NSVvB9Ox6Vbfj6C4
 v331LJmlPxG8fxBNA3q28FrTxcG1NW6sgo3WY9VoSp/vc0aqaPKhm7sbraTt5IrI
 TwqA/WhnAHv90MQCGFcofANyYTkjPkKk2QBFK6b0suoAmVdwVWWELi1WaZ+HdvgQ
 JP7YpmC2cXcQBPk=
 =ZGxL
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration changes:

   - Evaluate PCI Boot Configuration _DSM to learn if firmware wants us
     to preserve its resource assignments (Benjamin Herrenschmidt)

   - Simplify resource distribution (Nicholas Johnson)

   - Decode 32 GT/s link speed (Gustavo Pimentel)

  Virtualization:

   - Fix incorrect caching of VF config space size (Alex Williamson)

   - Fix VF driver probing sysfs knobs (Alex Williamson)

  Peer-to-peer DMA:

   - Fix dma_virt_ops check (Logan Gunthorpe)

  Altera host bridge driver:

   - Allow building as module (Ley Foon Tan)

  Armada 8K host bridge driver:

   - add PHYs support (Miquel Raynal)

  DesignWare host bridge driver:

   - Export APIs to support removable loadable module (Vidya Sagar)

   - Enable Relaxed Ordering erratum workaround only on Tegra20 &
     Tegra30 (Vidya Sagar)

  Hyper-V host bridge driver:

   - Fix use-after-free in eject (Dexuan Cui)

  Mobiveil host bridge driver:

   - Clean up and fix many issues, including non-identify mapped
     windows, 64-bit windows, multi-MSI, class code, INTx clearing (Hou
     Zhiqiang)

  Qualcomm host bridge driver:

   - Use clk bulk API for 2.4.0 controllers (Bjorn Andersson)

   - Add QCS404 support (Bjorn Andersson)

   - Assert PERST for at least 100ms (Niklas Cassel)

  R-Car host bridge driver:

   - Add r8a774a1 DT support (Biju Das)

  Tegra host bridge driver:

   - Add support for Gen2, opportunistic UpdateFC and ACK (PCIe protocol
     details) AER, GPIO-based PERST# (Manikanta Maddireddy)

   - Fix many issues, including power-on failure cases, interrupt
     masking in suspend, UPHY settings, AFI dynamic clock gating,
     pending DLL transactions (Manikanta Maddireddy)

  Xilinx host bridge driver:

   - Fix NWL Multi-MSI programming (Bharat Kumar Gogada)

  Endpoint support:

   - Fix 64bit BAR support (Alan Mikhak)

   - Fix pcitest build issues (Alan Mikhak, Andy Shevchenko)

  Bug fixes:

   - Fix NVIDIA GPU multi-function power dependencies (Abhishek Sahu)

   - Fix NVIDIA GPU HDA enablement issue (Lukas Wunner)

   - Ignore lockdep for sysfs "remove" (Marek Vasut)

  Misc:

   - Convert docs to reST (Changbin Du, Mauro Carvalho Chehab)"

* tag 'pci-v5.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (107 commits)
  PCI: Enable NVIDIA HDA controllers
  tools: PCI: Fix installation when `make tools/pci_install`
  PCI: dwc: pci-dra7xx: Fix compilation when !CONFIG_GPIOLIB
  PCI: Fix typos and whitespace errors
  PCI: mobiveil: Fix INTx interrupt clearing in mobiveil_pcie_isr()
  PCI: mobiveil: Fix infinite-loop in the INTx handling function
  PCI: mobiveil: Move PCIe PIO enablement out of inbound window routine
  PCI: mobiveil: Add upper 32-bit PCI base address setup in inbound window
  PCI: mobiveil: Add upper 32-bit CPU base address setup in outbound window
  PCI: mobiveil: Mask out hardcoded bits in inbound/outbound windows setup
  PCI: mobiveil: Clear the control fields before updating it
  PCI: mobiveil: Add configured inbound windows counter
  PCI: mobiveil: Fix the valid check for inbound and outbound windows
  PCI: mobiveil: Clean-up program_{ib/ob}_windows()
  PCI: mobiveil: Remove an unnecessary return value check
  PCI: mobiveil: Fix error return values
  PCI: mobiveil: Refactor the MEM/IO outbound window initialization
  PCI: mobiveil: Make some register updates more readable
  PCI: mobiveil: Reformat the code for readability
  dt-bindings: PCI: mobiveil: Change gpio_slave and apb_csr to optional
  ...
2019-07-15 20:44:49 -07:00
Andrii Nakryiko
1acc5d5c58 bpf: fix BTF verifier size resolution logic
BTF verifier has a size resolution bug which in some circumstances leads to
invalid size resolution for, e.g., TYPEDEF modifier.  This happens if we have
[1] PTR -> [2] TYPEDEF -> [3] ARRAY, in which case due to being in pointer
context ARRAY size won't be resolved (because for pointer it doesn't matter, so
it's a sink in pointer context), but it will be permanently remembered as zero
for TYPEDEF and TYPEDEF will be marked as RESOLVED. Eventually ARRAY size will
be resolved correctly, but TYPEDEF resolved_size won't be updated anymore.
This, subsequently, will lead to erroneous map creation failure, if that
TYPEDEF is specified as either key or value, as key_size/value_size won't
correspond to resolved size of TYPEDEF (kernel will believe it's zero).

Note, that if BTF was ordered as [1] ARRAY <- [2] TYPEDEF <- [3] PTR, this
won't be a problem, as by the time we get to TYPEDEF, ARRAY's size is already
calculated and stored.

This bug manifests itself in rejecting BTF-defined maps that use array
typedef as a value type:

typedef int array_t[16];

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(value, array_t); /* i.e., array_t *value; */
} test_map SEC(".maps");

The fix consists on not relying on modifier's resolved_size and instead using
modifier's resolved_id (type ID for "concrete" type to which modifier
eventually resolves) and doing size determination for that resolved type. This
allow to preserve existing "early DFS termination" logic for PTR or
STRUCT_OR_ARRAY contexts, but still do correct size determination for modifier
types.

Fixes: eb3f595dab ("bpf: btf: Validate type reference")
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-15 23:02:17 +02:00
Mauro Carvalho Chehab
da82c92f11 docs: cgroup-v1: add it to the admin-guide book
Those files belong to the admin guide, so add them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:02 -03:00
Mauro Carvalho Chehab
5704324702 docs: admin-guide: move sysctl directory to it
The stuff under sysctl describes /sys interface from userspace
point of view. So, add it to the admin-guide and remove the
:orphan: from its index file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:01 -03:00
Mauro Carvalho Chehab
53b9537509 docs: sysctl: convert to ReST
Rename the /proc/sys/ documentation files to ReST, using the
README file as a template for an index.rst, adding the other
files there via TOC markup.

Despite being written on different times with different
styles, try to make them somewhat coherent with a similar
look and feel, ensuring that they'll look nice as both
raw text file and as via the html output produced by the
Sphinx build system.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 09:20:26 -03:00
Mauro Carvalho Chehab
387b14684f docs: locking: convert docs to ReST and rename to *.rst
Convert the locking documents to ReST and add them to the
kernel development book where it belongs.

Most of the stuff here is just to make Sphinx to properly
parse the text file, as they're already in good shape,
not requiring massive changes in order to be parsed.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Federico Vaga <federico.vaga@vaga.pv.it>
2019-07-15 08:53:27 -03:00
Linus Torvalds
fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Linus Torvalds
1d03985933 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A number of PMU driver corner case fixes, a race fix, an event
  grouping fix, plus a bunch of tooling fixes/updates"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  perf/x86/intel: Fix spurious NMI on fixed counter
  perf/core: Fix exclusive events' grouping
  perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs
  perf/x86/amd/uncore: Do not set 'ThreadMask' and 'SliceMask' for non-L3 PMCs
  perf/core: Fix race between close() and fork()
  perf intel-pt: Fix potential NULL pointer dereference found by the smatch tool
  perf intel-bts: Fix potential NULL pointer dereference found by the smatch tool
  perf script: Assume native_arch for pipe mode
  perf scripts python: export-to-sqlite.py: Fix DROP VIEW power_events_view
  perf scripts python: export-to-postgresql.py: Fix DROP VIEW power_events_view
  perf hists browser: Fix potential NULL pointer dereference found by the smatch tool
  perf cs-etm: Fix potential NULL pointer dereference found by the smatch tool
  perf parse-events: Remove unused variable: error
  perf parse-events: Remove unused variable 'i'
  perf metricgroup: Add missing list_del_init() when flushing egroups list
  perf tools: Use list_del_init() more thorougly
  perf tools: Use zfree() where applicable
  tools lib: Adopt zalloc()/zfree() from tools/perf
  perf tools: Move get_current_dir_name() cond prototype out of util.h
  perf namespaces: Move the conditional setns() prototype to namespaces.h
  ...
2019-07-14 11:40:33 -07:00
Linus Torvalds
0c85ce1354 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "A single fix for a locking statistics bug"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Fix lock used or unused stats error
2019-07-14 11:39:21 -07:00
Dmitry V. Levin
028b6e8a89
clone: fix CLONE_PIDFD support
The introduction of clone3 syscall accidentally broke CLONE_PIDFD
support in traditional clone syscall on compat x86 and those
architectures that use do_fork to implement clone syscall.

This bug was found by strace test suite.

Link: https://strace.io/logs/strace/2019-07-12
Fixes: 7f192e3cd3 ("fork: add clone3")
Bisected-and-tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-14 20:36:12 +02:00
Linus Torvalds
50ec18819c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a sched statistics related bug that would trigger a kernel warning
  on certain configs"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix preempt warning in ttwu
2019-07-14 11:34:05 -07:00
Yuyang Du
68d41d8c94 locking/lockdep: Fix lock used or unused stats error
The stats variable nr_unused_locks is incremented every time a new lock
class is register and decremented when the lock is first used in
__lock_acquire(). And after all, it is shown and checked in lockdep_stats.

However, under configurations that either CONFIG_TRACE_IRQFLAGS or
CONFIG_PROVE_LOCKING is not defined:

The commit:

  0918065151 ("locking/lockdep: Consolidate lock usage bit initialization")

missed marking the LOCK_USED flag at IRQ usage initialization because
as mark_usage() is not called. And the commit:

  886532aee3 ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")

further made mark_lock() not defined such that the LOCK_USED cannot be
marked at all when the lock is first acquired.

As a result, we fix this by not showing and checking the stats under such
configurations for lockdep_stats.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: arnd@arndb.de
Cc: frederic@kernel.org
Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:24:53 +02:00
Peter Zijlstra
e3d85487fb sched/core: Fix preempt warning in ttwu
John reported a DEBUG_PREEMPT warning caused by commit:

  aacedf26fb ("sched/core: Optimize try_to_wake_up() for local wakeups")

I overlooked that ttwu_stat() requires preemption disabled.

Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: aacedf26fb ("sched/core: Optimize try_to_wake_up() for local wakeups")
Link: https://lkml.kernel.org/r/20190710105736.GK3402@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:23:27 +02:00
Alexander Shishkin
8a58ddae23 perf/core: Fix exclusive events' grouping
So far, we tried to disallow grouping exclusive events for the fear of
complications they would cause with moving between contexts. Specifically,
moving a software group to a hardware context would violate the exclusivity
rules if both groups contain matching exclusive events.

This attempt was, however, unsuccessful: the check that we have in the
perf_event_open() syscall is both wrong (looks at wrong PMU) and
insufficient (group leader may still be exclusive), as can be illustrated
by running:

  $ perf record -e '{intel_pt//,cycles}' uname
  $ perf record -e '{cycles,intel_pt//}' uname

ultimately successfully.

Furthermore, we are completely free to trigger the exclusivity violation
by:

   perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'

even though the helpful perf record will not allow that, the ABI will.

The warning later in the perf_event_open() path will also not trigger, because
it's also wrong.

Fix all this by validating the original group before moving, getting rid
of broken safeguards and placing a useful one to perf_install_in_context().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mathieu.poirier@linaro.org
Cc: will.deacon@arm.com
Fixes: bed5b25ad9 ("perf: Add a pmu capability for "exclusive" events")
Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:28 +02:00
Peter Zijlstra
1cf8dfe8a6 perf/core: Fix race between close() and fork()
Syzcaller reported the following Use-after-Free bug:

	close()						clone()

							  copy_process()
							    perf_event_init_task()
							      perf_event_init_context()
							        mutex_lock(parent_ctx->mutex)
								inherit_task_group()
								  inherit_group()
								    inherit_event()
								      mutex_lock(event->child_mutex)
								      // expose event on child list
								      list_add_tail()
								      mutex_unlock(event->child_mutex)
							        mutex_unlock(parent_ctx->mutex)

							    ...
							    goto bad_fork_*

							  bad_fork_cleanup_perf:
							    perf_event_free_task()

	  perf_release()
	    perf_event_release_kernel()
	      list_for_each_entry()
		mutex_lock(ctx->mutex)
		mutex_lock(event->child_mutex)
		// event is from the failing inherit
		// on the other CPU
		perf_remove_from_context()
		list_move()
		mutex_unlock(event->child_mutex)
		mutex_unlock(ctx->mutex)

							      mutex_lock(ctx->mutex)
							      list_for_each_entry_safe()
							        // event already stolen
							      mutex_unlock(ctx->mutex)

							    delayed_free_task()
							      free_task()

	     list_for_each_entry_safe()
	       list_del()
	       free_event()
	         _free_event()
		   // and so event->hw.target
		   // is the already freed failed clone()
		   if (event->hw.target)
		     put_task_struct(event->hw.target)
		       // WHOOPSIE, already quite dead

Which puts the lie to the the comment on perf_event_free_task():
'unexposed, unused context' not so much.

Which is a 'fun' confluence of fail; copy_process() doing an
unconditional free_task() and not respecting refcounts, and perf having
creative locking. In particular:

  82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")

seems to have overlooked this 'fun' parade.

Solve it by using the fact that detached events still have a reference
count on their (previous) context. With this perf_event_free_task()
can detect when events have escaped and wait for their destruction.

Debugged-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:25 +02:00
Linus Torvalds
39ceda5ce1 Kbuild updates for v5.3
- remove headers_{install,check}_all targets
 
 - remove unreasonable 'depends on !UML' from CONFIG_SAMPLES
 
 - re-implement 'make headers_install' more cleanly
 
 - add new header-test-y syntax to compile-test headers
 
 - compile-test exported headers to ensure they are compilable in
   user-space
 
 - compile-test headers under include/ to ensure they are self-contained
 
 - remove -Waggregate-return, -Wno-uninitialized, -Wno-unused-value flags
 
 - add -Werror=unknown-warning-option for Clang
 
 - add 128-bit built-in types support to genksyms
 
 - fix missed rebuild of modules.builtin
 
 - propagate 'No space left on device' error in fixdep to Make
 
 - allow Clang to use its integrated assembler
 
 - improve some coccinelle scripts
 
 - add a new flag KBUILD_ABS_SRCTREE to request Kbuild to use absolute
   path for $(srctree).
 
 - do not ignore errors when compression utility is missing
 
 - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl0oxNkeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGnhcP/AuM8s+3SYFiLitJ
 ISbznLFP2Xatq0SPXp5+moez/AMTK6Mm1biPcdo20d+TjVEh4+9F2nq12Ii9U8/D
 tds9A6G8+Bb28r9GMIVQPdFohijW6ijtDziS31iQnIWyPsP/yx6PKfLAD9F4ca1x
 7/4btmu+BOMjtN0NrMWSNz5MM47xUzoWIALL40SV4PzGVXLCQZ2PBNPeSRIk22Jt
 ynDNPuNsmDWcFfwAE+sLSDrhCHZlwM8rg8rf6jmYdc4LcN4cj0oho5+K1TRyC9mn
 fO3PT25juFejthxQulxEfyGggnyLM6BNTgPDGcCHSP4nD7mlXA9GcpZICtJOgGGu
 SlDadMZ0GRMK5zcZ0MF0GQboeyViwsbXgrRcYuXt6cUFWX4P/1SeAQ5Mf4u1EKqf
 hEbwFXV/g81ht0lFS8gyWkvdpoNPtxGHNPusLjp65C4rc0/48/s+7EE/u8JTPl1g
 dQTeIOds6XUOkJgqhEfuq+8gfngbjKc9bYhs+ACbkCzBltQdnb6m5aLgk0ODxe8I
 WbGn0+cQcS9VVwre7E5DnFSVWVOHAG5taiUwj0KDcHB0Jxw9Gvorq9WU1ppHHYH2
 XQIFBx7XHdn28d+plS8R23vAPgDgrGdvE5RYK5tNQLhTJ6BbjlZ1n/Tmxzu62scK
 deG3aCOB13Om7OTzTUh9+C3TC9ZQ
 =E2Rz
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - remove headers_{install,check}_all targets

 - remove unreasonable 'depends on !UML' from CONFIG_SAMPLES

 - re-implement 'make headers_install' more cleanly

 - add new header-test-y syntax to compile-test headers

 - compile-test exported headers to ensure they are compilable in
   user-space

 - compile-test headers under include/ to ensure they are self-contained

 - remove -Waggregate-return, -Wno-uninitialized, -Wno-unused-value
   flags

 - add -Werror=unknown-warning-option for Clang

 - add 128-bit built-in types support to genksyms

 - fix missed rebuild of modules.builtin

 - propagate 'No space left on device' error in fixdep to Make

 - allow Clang to use its integrated assembler

 - improve some coccinelle scripts

 - add a new flag KBUILD_ABS_SRCTREE to request Kbuild to use absolute
   path for $(srctree).

 - do not ignore errors when compression utility is missing

 - misc cleanups

* tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (49 commits)
  kbuild: use -- separater intead of $(filter-out ...) for cc-cross-prefix
  kbuild: Inform user to pass ARCH= for make mrproper
  kbuild: fix compression errors getting ignored
  kbuild: add a flag to force absolute path for srctree
  kbuild: replace KBUILD_SRCTREE with boolean building_out_of_srctree
  kbuild: remove src and obj from the top Makefile
  scripts/tags.sh: remove unused environment variables from comments
  scripts/tags.sh: drop SUBARCH support for ARM
  kbuild: compile-test kernel headers to ensure they are self-contained
  kheaders: include only headers into kheaders_data.tar.xz
  kheaders: remove meaningless -R option of 'ls'
  kbuild: support header-test-pattern-y
  kbuild: do not create wrappers for header-test-y
  kbuild: compile-test exported headers to ensure they are self-contained
  init/Kconfig: add CONFIG_CC_CAN_LINK
  kallsyms: exclude kasan local symbols on s390
  kbuild: add more hints about SUBDIRS replacement
  coccinelle: api/stream_open: treat all wait_.*() calls as blocking
  coccinelle: put_device: Add a cast to an expression for an assignment
  coccinelle: put_device: Adjust a message construction
  ...
2019-07-12 16:03:16 -07:00
Linus Torvalds
9e3a25dc99 dma-mapping updates for Linux 5.3
- move the USB special case that bounced DMA through a device
    bar into the USB code instead of handling it in the common
    DMA code (Laurentiu Tudor and Fredrik Noring)
  - don't dip into the global CMA pool for single page allocations
    (Nicolin Chen)
  - fix a crash when allocating memory for the atomic pool failed
    during boot (Florian Fainelli)
  - move support for MIPS-style uncached segments to the common
    code and use that for MIPS and nios2 (me)
  - make support for DMA_ATTR_NON_CONSISTENT and
    DMA_ATTR_NO_KERNEL_MAPPING generic (me)
  - convert nds32 to the generic remapping allocator (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl0nPqgLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNj2hAAxIv2O3wv6V5xhzWwOVo8e/xW1ZLlGAF0/z92u0do
 32Tm8jkdAGjZDnyxam7qisMSIjCNykpauQzVVxyUNBRSsn1V5t7KSaH3/OXCOVcr
 x2VWBirxGO2BbRseaCBjIcA/2qna+VIDGFcNXCtf6rM00YUK6qaJzkMwBKQAeYcM
 uJMJkaf8qaW4hygLJP8axXiGFdIJyFNLAlJ+ok6kYsJHHJNceOp0bo3CDa2mJBK9
 IhraK2zVkyE5EQkQM5cE/Kw1ppPelUKUkHwjgM4wpz2b18WbLu11nKP0hmUcvKRQ
 heY8xWiKxN0QTgS03ou7EVylyrSAE4dIKgzuA4VO32QCGsWypcAg4iU6s5TX6p9g
 tZEW2ckE6wbmRdQPyKoDpZg299/eQjRHc4MAA1yinT8tFMokw2tk8Fq1FWyltwL1
 8EiP5oNs2qUNvNgqUresl6/f6YOacFi1Q6IhgBVj6d6lyhMhlsHfW4w1XA1siv/I
 6l4qJbLohYab6hY7i+mBOd8iG/KrAlr4P6admnv2jDchswbb5t2j+ABE9xv++PFi
 u1HFqMlxqdWQaXGca2UeCUxUjkwO9N+kHpP+VRz+6D2b64dtCWSu8CN23sYXm2tO
 ubWIlrQQZPhhMkoFg7XqKSTacd+ut+SXN9Nxsyv548ETV0l1xbiLRHIbhyoIESD5
 RAI=
 =01Fr
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - move the USB special case that bounced DMA through a device bar into
   the USB code instead of handling it in the common DMA code (Laurentiu
   Tudor and Fredrik Noring)

 - don't dip into the global CMA pool for single page allocations
   (Nicolin Chen)

 - fix a crash when allocating memory for the atomic pool failed during
   boot (Florian Fainelli)

 - move support for MIPS-style uncached segments to the common code and
   use that for MIPS and nios2 (me)

 - make support for DMA_ATTR_NON_CONSISTENT and
   DMA_ATTR_NO_KERNEL_MAPPING generic (me)

 - convert nds32 to the generic remapping allocator (me)

* tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping: (29 commits)
  dma-mapping: mark dma_alloc_need_uncached as __always_inline
  MIPS: only select ARCH_HAS_UNCACHED_SEGMENT for non-coherent platforms
  usb: host: Fix excessive alignment restriction for local memory allocations
  lib/genalloc.c: Add algorithm, align and zeroed family of DMA allocators
  nios2: use the generic uncached segment support in dma-direct
  nds32: use the generic remapping allocator for coherent DMA allocations
  arc: use the generic remapping allocator for coherent DMA allocations
  dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code
  dma-direct: handle DMA_ATTR_NON_CONSISTENT in common code
  dma-mapping: add a dma_alloc_need_uncached helper
  openrisc: remove the partial DMA_ATTR_NON_CONSISTENT support
  arc: remove the partial DMA_ATTR_NON_CONSISTENT support
  arm-nommu: remove the partial DMA_ATTR_NON_CONSISTENT support
  ARM: dma-mapping: allow larger DMA mask than supported
  dma-mapping: truncate dma masks to what dma_addr_t can hold
  iommu/dma: Apply dma_{alloc,free}_contiguous functions
  dma-remap: Avoid de-referencing NULL atomic_pool
  MIPS: use the generic uncached segment support in dma-direct
  dma-direct: provide generic support for uncached kernel segments
  au1100fb: fix DMA API abuse
  ...
2019-07-12 15:13:55 -07:00
Linus Torvalds
f632a8170a Driver Core and debugfs changes for 5.3-rc1
Here is the "big" driver core and debugfs changes for 5.3-rc1
 
 It's a lot of different patches, all across the tree due to some api
 changes and lots of debugfs cleanups.  Because of this, there is going
 to be some merge issues with your tree at the moment, I'll follow up
 with the expected resolutions to make it easier for you.
 
 Other than the debugfs cleanups, in this set of changes we have:
 	- bus iteration function cleanups (will cause build warnings
 	  with s390 and coresight drivers in your tree)
 	- scripts/get_abi.pl tool to display and parse Documentation/ABI
 	  entries in a simple way
 	- cleanups to Documenatation/ABI/ entries to make them parse
 	  easier due to typos and other minor things
 	- default_attrs use for some ktype users
 	- driver model documentation file conversions to .rst
 	- compressed firmware file loading
 	- deferred probe fixes
 
 All of these have been in linux-next for a while, with a bunch of merge
 issues that Stephen has been patient with me for.  Other than the merge
 issues, functionality is working properly in linux-next :)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
 LpRyb3zX29oChFaZkc5a
 =XrEZ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and debugfs updates from Greg KH:
 "Here is the "big" driver core and debugfs changes for 5.3-rc1

  It's a lot of different patches, all across the tree due to some api
  changes and lots of debugfs cleanups.

  Other than the debugfs cleanups, in this set of changes we have:

   - bus iteration function cleanups

   - scripts/get_abi.pl tool to display and parse Documentation/ABI
     entries in a simple way

   - cleanups to Documenatation/ABI/ entries to make them parse easier
     due to typos and other minor things

   - default_attrs use for some ktype users

   - driver model documentation file conversions to .rst

   - compressed firmware file loading

   - deferred probe fixes

  All of these have been in linux-next for a while, with a bunch of
  merge issues that Stephen has been patient with me for"

* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
  debugfs: make error message a bit more verbose
  orangefs: fix build warning from debugfs cleanup patch
  ubifs: fix build warning after debugfs cleanup patch
  driver: core: Allow subsystems to continue deferring probe
  drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
  arch_topology: Remove error messages on out-of-memory conditions
  lib: notifier-error-inject: no need to check return value of debugfs_create functions
  swiotlb: no need to check return value of debugfs_create functions
  ceph: no need to check return value of debugfs_create functions
  sunrpc: no need to check return value of debugfs_create functions
  ubifs: no need to check return value of debugfs_create functions
  orangefs: no need to check return value of debugfs_create functions
  nfsd: no need to check return value of debugfs_create functions
  lib: 842: no need to check return value of debugfs_create functions
  debugfs: provide pr_fmt() macro
  debugfs: log errors when something goes wrong
  drivers: s390/cio: Fix compilation warning about const qualifiers
  drivers: Add generic helper to match by of_node
  driver_find_device: Unify the match function with class_find_device()
  bus_find_device: Unify the match callback with class_find_device
  ...
2019-07-12 12:24:03 -07:00
Linus Torvalds
ef8f3d48af Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Am experimenting with splitting MM up into identifiable subsystems
  perhaps with a view to gitifying it in complex ways. Also with more
  verbose "incoming" emails.

  Most of MM is here and a few other trees.

  Subsystems affected by this patch series:
   - hotfixes
   - iommu
   - scripts
   - arch/sh
   - ocfs2
   - mm:slab-generic
   - mm:slub
   - mm:kmemleak
   - mm:kasan
   - mm:cleanups
   - mm:debug
   - mm:pagecache
   - mm:swap
   - mm:memcg
   - mm:gup
   - mm:pagemap
   - mm:infrastructure
   - mm:vmalloc
   - mm:initialization
   - mm:pagealloc
   - mm:vmscan
   - mm:tools
   - mm:proc
   - mm:ras
   - mm:oom-kill

  hotfixes:
      mm: vmscan: scan anonymous pages on file refaults
      mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
      mm/memcontrol: fix wrong statistics in memory.stat
      mm/z3fold.c: lock z3fold page before  __SetPageMovable()
      nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
      MAINTAINERS: nilfs2: update email address

  iommu:
      include/linux/dmar.h: replace single-char identifiers in macros

  scripts:
      scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
      scripts/decode_stacktrace: look for modules with .ko.debug extension
      scripts/spelling.txt: drop "sepc" from the misspelling list
      scripts/spelling.txt: add spelling fix for prohibited
      scripts/decode_stacktrace: Accept dash/underscore in modules
      scripts/spelling.txt: add more spellings to spelling.txt

  arch/sh:
      arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
      sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
      sh: prevent warnings when using iounmap

  ocfs2:
      fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
      ocfs2/dlm: use struct_size() helper
      ocfs2: add last unlock times in locking_state
      ocfs2: add locking filter debugfs file
      ocfs2: add first lock wait time in locking_state
      ocfs: no need to check return value of debugfs_create functions
      fs/ocfs2/dlmglue.c: unneeded variable: "status"
      ocfs2: use kmemdup rather than duplicating its implementation

  mm:slab-generic:
    Patch series "mm/slab: Improved sanity checking":
      mm/slab: validate cache membership under freelist hardening
      mm/slab: sanity-check page type when looking up cache
      lkdtm/heap: add tests for freelist hardening

  mm:slub:
      mm/slub.c: avoid double string traverse in kmem_cache_flags()
      slub: don't panic for memcg kmem cache creation failure

  mm:kmemleak:
      mm/kmemleak.c: fix check for softirq context
      mm/kmemleak.c: change error at _write when kmemleak is disabled
      docs: kmemleak: add more documentation details

  mm:kasan:
      mm/kasan: print frame description for stack bugs
      Patch series "Bitops instrumentation for KASAN", v5:
        lib/test_kasan: add bitops tests
        x86: use static_cpu_has in uaccess region to avoid instrumentation
        asm-generic, x86: add bitops instrumentation for KASAN
      Patch series "mm/kasan: Add object validation in ksize()", v3:
        mm/kasan: introduce __kasan_check_{read,write}
        mm/kasan: change kasan_check_{read,write} to return boolean
        lib/test_kasan: Add test for double-kzfree detection
        mm/slab: refactor common ksize KASAN logic into slab_common.c
        mm/kasan: add object validation in ksize()

  mm:cleanups:
      include/linux/pfn_t.h: remove pfn_t_to_virt()
      Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
        arm: remove ARCH_SELECT_MEMORY_MODEL
        s390: remove ARCH_SELECT_MEMORY_MODEL
        sparc: remove ARCH_SELECT_MEMORY_MODEL
      mm/gup.c: make follow_page_mask() static
      mm/memory.c: trivial clean up in insert_page()
      mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
      include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
      mm: remove the account_page_dirtied export
      mm/page_isolation.c: change the prototype of undo_isolate_page_range()
      include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
      mm: remove the exporting of totalram_pages
      include/linux/pagemap.h: document trylock_page() return value

  mm:debug:
      mm/failslab.c: by default, do not fail allocations with direct reclaim only
      Patch series "debug_pagealloc improvements":
        mm, debug_pagelloc: use static keys to enable debugging
        mm, page_alloc: more extensive free page checking with debug_pagealloc
        mm, debug_pagealloc: use a page type instead of page_ext flag

  mm:pagecache:
      Patch series "fix filler_t callback type mismatches", v2:
        mm/filemap.c: fix an overly long line in read_cache_page
        mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
        jffs2: pass the correct prototype to read_cache_page
        9p: pass the correct prototype to read_cache_page
      mm/filemap.c: correct the comment about VM_FAULT_RETRY

  mm:swap:
      mm, swap: fix race between swapoff and some swap operations
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
      mm, swap: use rbtree for swap_extent
      mm/mincore.c: fix race between swapoff and mincore

  mm:memcg:
      memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
      memcg, fsnotify: no oom-kill for remote memcg charging
      mm, memcg: introduce memory.events.local
      mm: memcontrol: dump memory.stat during cgroup OOM
      Patch series "mm: reparent slab memory on cgroup removal", v7:
        mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
        mm: memcg/slab: rename slab delayed deactivation functions and fields
        mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
        mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
        mm: memcg/slab: unify SLAB and SLUB page accounting
        mm: memcg/slab: don't check the dying flag on kmem_cache creation
        mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
        mm: memcg/slab: rework non-root kmem_cache lifecycle management
        mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
        mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
      mm, memcg: add a memcg_slabinfo debugfs file

  mm:gup:
      Patch series "switch the remaining architectures to use generic GUP", v4:
        mm: use untagged_addr() for get_user_pages_fast addresses
        mm: simplify gup_fast_permitted
        mm: lift the x86_32 PAE version of gup_get_pte to common code
        MIPS: use the generic get_user_pages_fast code
        sh: add the missing pud_page definition
        sh: use the generic get_user_pages_fast code
        sparc64: add the missing pgd_page definition
        sparc64: define untagged_addr()
        sparc64: use the generic get_user_pages_fast code
        mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
        mm: reorder code blocks in gup.c
        mm: consolidate the get_user_pages* implementations
        mm: validate get_user_pages_fast flags
        mm: move the powerpc hugepd code to mm/gup.c
        mm: switch gup_hugepte to use try_get_compound_head
        mm: mark the page referenced in gup_hugepte
      mm/gup: speed up check_and_migrate_cma_pages() on huge page
      mm/gup.c: remove some BUG_ONs from get_gate_page()
      mm/gup.c: mark undo_dev_pagemap as __maybe_unused

  mm:pagemap:
      asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
      alpha: switch to generic version of pte allocation
      arm: switch to generic version of pte allocation
      arm64: switch to generic version of pte allocation
      csky: switch to generic version of pte allocation
      m68k: sun3: switch to generic version of pte allocation
      mips: switch to generic version of pte allocation
      nds32: switch to generic version of pte allocation
      nios2: switch to generic version of pte allocation
      parisc: switch to generic version of pte allocation
      riscv: switch to generic version of pte allocation
      um: switch to generic version of pte allocation
      unicore32: switch to generic version of pte allocation
      mm/pgtable: drop pgtable_t variable from pte_fn_t functions
      mm/memory.c: fail when offset == num in first check of __vm_map_pages()

  mm:infrastructure:
      mm/mmu_notifier: use hlist_add_head_rcu()

  mm:vmalloc:
      Patch series "Some cleanups for the KVA/vmalloc", v5:
        mm/vmalloc.c: remove "node" argument
        mm/vmalloc.c: preload a CPU with one object for split purpose
        mm/vmalloc.c: get rid of one single unlink_va() when merge
        mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
      mm/vmalloc.c: spelling> s/informaion/information/

  mm:initialization:
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
      mm/large system hash: clear hashdist when only one node with memory is booted

  mm:pagealloc:
      arm64: move jump_label_init() before parse_early_param()
      Patch series "add init_on_alloc/init_on_free boot options", v10:
        mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
        mm: init: report memory auto-initialization features at boot time

  mm:vmscan:
      mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
      mm: vmscan: correct some vmscan counters for THP swapout

  mm:tools:
      tools/vm/slabinfo: order command line options
      tools/vm/slabinfo: add partial slab listing to -X
      tools/vm/slabinfo: add option to sort by partial slabs
      tools/vm/slabinfo: add sorting info to help menu

  mm:proc:
      proc: use down_read_killable mmap_sem for /proc/pid/maps
      proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
      proc: use down_read_killable mmap_sem for /proc/pid/pagemap
      proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
      proc: use down_read_killable mmap_sem for /proc/pid/map_files
      mm: use down_read_killable for locking mmap_sem in access_remote_vm
      mm: smaps: split PSS into components
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo

  mm:ras:
      mm/memory-failure.c: clarify error message

  mm:oom-kill:
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
      mm, oom: refactor dump_tasks for memcg OOMs
      mm, oom: remove redundant task_in_mem_cgroup() check
      oom: decouple mems_allowed from oom_unkillable_task
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"

* akpm: (147 commits)
  mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
  oom: decouple mems_allowed from oom_unkillable_task
  mm, oom: remove redundant task_in_mem_cgroup() check
  mm, oom: refactor dump_tasks for memcg OOMs
  mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
  mm/memory-failure.c: clarify error message
  mm: vmalloc: show number of vmalloc pages in /proc/meminfo
  mm: smaps: split PSS into components
  mm: use down_read_killable for locking mmap_sem in access_remote_vm
  proc: use down_read_killable mmap_sem for /proc/pid/map_files
  proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
  proc: use down_read_killable mmap_sem for /proc/pid/pagemap
  proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
  proc: use down_read_killable mmap_sem for /proc/pid/maps
  tools/vm/slabinfo: add sorting info to help menu
  tools/vm/slabinfo: add option to sort by partial slabs
  tools/vm/slabinfo: add partial slab listing to -X
  tools/vm/slabinfo: order command line options
  mm: vmscan: correct some vmscan counters for THP swapout
  mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
  ...
2019-07-12 11:40:28 -07:00
Aneesh Kumar K.V
9bd3bb6703 mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
Architectures like powerpc use different address range to map ioremap
and vmalloc range.  The memunmap() check used by the nvdimm layer was
wrongly using is_vmalloc_addr() to check for ioremap range which fails
for ppc64.  This result in ppc64 not freeing the ioremap mapping.  The
side effect of this is an unbind failure during module unload with
papr_scm nvdimm driver

Link: http://lkml.kernel.org/r/20190701134038.14165-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Fixes: b5beae5e22 ("powerpc/pseries: Add driver for PAPR SCM regions")
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:40 -07:00
Gustavo A. R. Silva
ed4ed4043a bpf: verifier: avoid fall-through warnings
In preparation to enabling -Wimplicit-fallthrough, this patch silences
the following warning:

kernel/bpf/verifier.c: In function ‘check_return_code’:
kernel/bpf/verifier.c:6106:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG ||
      ^
kernel/bpf/verifier.c:6109:2: note: here
  case BPF_PROG_TYPE_CGROUP_SKB:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Notice that is much clearer to explicitly add breaks in each case
statement (that actually contains some code), rather than letting
the code to fall through.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-12 15:13:20 +02:00
Andrii Nakryiko
b3b50f05dc bpf: fix precision bit propagation for BPF_ST instructions
When backtracking instructions to propagate precision bit for registers
and stack slots, one class of instructions (BPF_ST) weren't handled
causing extra stack slots to be propagated into parent state. Parent
state might not have that much stack allocated, though, which causes
warning on invalid stack slot usage.

This patch adds handling of BPF_ST instructions:

BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32

Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-12 14:48:52 +02:00
Vincent Whitchurch
c9dccacfcc printk: Do not lose last line in kmsg buffer dump
kmsg_dump_get_buffer() is supposed to select all the youngest log
messages which fit into the provided buffer.  It determines the correct
start index by using msg_print_text() with a NULL buffer to calculate
the size of each entry.  However, when performing the actual writes,
msg_print_text() only writes the entry to the buffer if the written len
is lesser than the size of the buffer.  So if the lengths of the
selected youngest log messages happen to precisely fill up the provided
buffer, the last log message is not included.

We don't want to modify msg_print_text() to fill up the buffer and start
returning a length which is equal to the size of the buffer, since
callers of its other users, such as kmsg_dump_get_line(), depend upon
the current behaviour.

Instead, fix kmsg_dump_get_buffer() to compensate for this.

For example, with the following two final prints:

[    6.427502] AAAAAAAAAAAAA
[    6.427769] BBBBBBBB12345

A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this
patch:

 00000000: 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37  <0>[    6.522197
 00000010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a  ] AAAAAAAAAAAAA.
 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

After this patch:

 00000000: 3c 30 3e 5b 20 20 20 20 36 2e 34 35 36 36 37 38  <0>[    6.456678
 00000010: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a  ] BBBBBBBB12345.
 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

Link: http://lkml.kernel.org/r/20190711142937.4083-1-vincent.whitchurch@axis.com
Fixes: e2ae715d66 ("kmsg - kmsg_dump() use iterator to receive log buffer content")
To: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # v3.5+
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-07-12 14:10:58 +02:00
Linus Torvalds
db0457338e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Jiri Kosina:

 - stacktrace handling improvements from Miroslav benes

 - debug output improvements from Petr Mladek

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove duplicate warning about missing reliable stacktrace support
  Revert "livepatch: Remove reliable stacktrace check in klp_try_switch_task()"
  stacktrace: Remove weak version of save_stack_trace_tsk_reliable()
  livepatch: Use static buffer for debugging messages under rq lock
  livepatch: Remove stale kobj_added entries from kernel-doc descriptions
2019-07-11 15:30:05 -07:00
Linus Torvalds
d7fe42a64a Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two small fixes from the timer departement:

   - Prevent the compiler from converting the nanoseconds adjustment
     loop in the VDSO update function to a division (__udivdi3) by using
     the __iter_div_u64_rem() inline function which exists to prevent
     exactly that problem.

   - Fix the wrong argument order of the GENMASK macro in the NPCM timer
     driver"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Use __iter_div_u64_rem()
  clocksource/drivers/npcm: Fix misuse of GENMASK macro
2019-07-11 13:52:23 -07:00
Linus Torvalds
02150fab6a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stacktrace fix from Thomas Gleixner:
 "Fix yet another instance of kernel thread check which ignores that
  kernel threads can call use_mm()"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace: Use PF_KTHREAD to check for kernel threads
2019-07-11 13:50:44 -07:00
Linus Torvalds
237f83dfbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Some highlights from this development cycle:

   1) Big refactoring of ipv6 route and neigh handling to support
      nexthop objects configurable as units from userspace. From David
      Ahern.

   2) Convert explored_states in BPF verifier into a hash table,
      significantly decreased state held for programs with bpf2bpf
      calls, from Alexei Starovoitov.

   3) Implement bpf_send_signal() helper, from Yonghong Song.

   4) Various classifier enhancements to mvpp2 driver, from Maxime
      Chevallier.

   5) Add aRFS support to hns3 driver, from Jian Shen.

   6) Fix use after free in inet frags by allocating fqdirs dynamically
      and reworking how rhashtable dismantle occurs, from Eric Dumazet.

   7) Add act_ctinfo packet classifier action, from Kevin
      Darbyshire-Bryant.

   8) Add TFO key backup infrastructure, from Jason Baron.

   9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

  10) Add devlink notifications for flash update status to mlxsw driver,
      from Jiri Pirko.

  11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

  12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

  13) Various enhancements to ipv6 flow label handling, from Eric
      Dumazet and Willem de Bruijn.

  14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
      der Merwe, and others.

  15) Various improvements to axienet driver including converting it to
      phylink, from Robert Hancock.

  16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

  17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
      Radulescu.

  18) Add devlink health reporting to mlx5, from Moshe Shemesh.

  19) Convert stmmac over to phylink, from Jose Abreu.

  20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
      Shalom Toledo.

  21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

  22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

  23) Track spill/fill of constants in BPF verifier, from Alexei
      Starovoitov.

  24) Support bounded loops in BPF, from Alexei Starovoitov.

  25) Various page_pool API fixes and improvements, from Jesper Dangaard
      Brouer.

  26) Just like ipv4, support ref-countless ipv6 route handling. From
      Wei Wang.

  27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

  28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

  29) Add flower GRE encap/decap support to nfp driver, from Pieter
      Jansen van Vuuren.

  30) Protect against stack overflow when using act_mirred, from John
      Hurley.

  31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

  32) Use page_pool API in netsec driver, Ilias Apalodimas.

  33) Add Google gve network driver, from Catherine Sullivan.

  34) More indirect call avoidance, from Paolo Abeni.

  35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

  36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

  37) Add MPLS manipulation actions to TC, from John Hurley.

  38) Add sending a packet to connection tracking from TC actions, and
      then allow flower classifier matching on conntrack state. From
      Paul Blakey.

  39) Netfilter hw offload support, from Pablo Neira Ayuso"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
  net/mlx5e: Return in default case statement in tx_post_resync_params
  mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
  net: dsa: add support for BRIDGE_MROUTER attribute
  pkt_sched: Include const.h
  net: netsec: remove static declaration for netsec_set_tx_de()
  net: netsec: remove superfluous if statement
  netfilter: nf_tables: add hardware offload support
  net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
  net: flow_offload: add flow_block_cb_is_busy() and use it
  net: sched: remove tcf block API
  drivers: net: use flow block API
  net: sched: use flow block API
  net: flow_offload: add flow_block_cb_{priv, incref, decref}()
  net: flow_offload: add list handling functions
  net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
  net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
  net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
  net: flow_offload: add flow_block_cb_setup_simple()
  net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
  net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
  ...
2019-07-11 10:55:49 -07:00
Linus Torvalds
8f6ccf6159 clone3-v5.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhhgAKCRCRxhvAZXjc
 or7kAP9VzDcQaK/WoDd2ezh2C7Wh5hNy9z/qJVCa6Tb+N+g1UgEAxbhFUg55uGOA
 JNf7fGar5JF5hBMIXR+NqOi1/sb4swg=
 =ELWo
 -----END PGP SIGNATURE-----

Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3 system call from Christian Brauner:
 "This adds the clone3 syscall which is an extensible successor to clone
  after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
  window for clone(). It cleanly supports all of the flags from clone()
  and thus all legacy workloads.

  There are few user visible differences between clone3 and clone.
  First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
  this flag. Second, the CSIGNAL flag is deprecated and will cause
  EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
  argument in struct clone_args thus freeing up even more flags. And
  third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
  clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
  argument.

  The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

    /* uapi */
    struct clone_args {
            __aligned_u64 flags;
            __aligned_u64 pidfd;
            __aligned_u64 child_tid;
            __aligned_u64 parent_tid;
            __aligned_u64 exit_signal;
            __aligned_u64 stack;
            __aligned_u64 stack_size;
            __aligned_u64 tls;
    };

  and a separate kernel struct is used that uses proper kernel typing:

    /* kernel internal */
    struct kernel_clone_args {
            u64 flags;
            int __user *pidfd;
            int __user *child_tid;
            int __user *parent_tid;
            int exit_signal;
            unsigned long stack;
            unsigned long stack_size;
            unsigned long tls;
    };

  The system call comes with a size argument which enables the kernel to
  detect what version of clone_args userspace is passing in. clone3
  validates that any additional bytes a given kernel does not know about
  are set to zero and that the size never exceeds a page.

  A nice feature is that this patchset allowed us to cleanup and
  simplify various core kernel codepaths in kernel/fork.c by making the
  internal _do_fork() function take struct kernel_clone_args even for
  legacy clone().

  This patch also unblocks the time namespace patchset which wants to
  introduce a new CLONE_TIMENS flag.

  Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
  xtensa. These were the architectures that did not require special
  massaging.

  Other architectures treat fork-like system calls individually and
  after some back and forth neither Arnd nor I felt confident that we
  dared to add clone3 unconditionally to all architectures. We agreed to
  leave this up to individual architecture maintainers. This is why
  there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
  which any architecture can set once it has implemented support for
  clone3. The patch also adds a cond_syscall(clone3) for architectures
  such as nios2 or h8300 that generate their syscall table by simply
  including asm-generic/unistd.h. The hope is to get rid of
  __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"

* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: handle arches who do not yet define clone3
  arch: wire-up clone3() syscall
  fork: add clone3
2019-07-11 10:09:44 -07:00
Linus Torvalds
5450e8a316 pidfd-updates-v5.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhUgAKCRCRxhvAZXjc
 okkiAQC3Hlg/O2JoIb4PqgEvBkpHSdVxyuWagn0ksjACW9ANKQEAl5OadMhvOq16
 UHGhKlpE/M8HflknIffoEGlIAWHrdwU=
 =7kP5
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This adds two main features.

   - First, it adds polling support for pidfds. This allows process
     managers to know when a (non-parent) process dies in a race-free
     way.

     The notification mechanism used follows the same logic that is
     currently used when the parent of a task is notified of a child's
     death. With this patchset it is possible to put pidfds in an
     {e}poll loop and get reliable notifications for process (i.e.
     thread-group) exit.

   - The second feature compliments the first one by making it possible
     to retrieve pollable pidfds for processes that were not created
     using CLONE_PIDFD.

     A lot of processes get created with traditional PID-based calls
     such as fork() or clone() (without CLONE_PIDFD). For these
     processes a caller can currently not create a pollable pidfd. This
     is a problem for Android's low memory killer (LMK) and service
     managers such as systemd.

  Both patchsets are accompanied by selftests.

  It's perhaps worth noting that the work done so far and the work done
  in this branch for pidfd_open() and polling support do already see
  some adoption:

   - Android is in the process of backporting this work to all their LTS
     kernels [1]

   - Service managers make use of pidfd_send_signal but will need to
     wait until we enable waiting on pidfds for full adoption.

   - And projects I maintain make use of both pidfd_send_signal and
     CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

[2] aab6e3eb73/src/lxc/start.c (L1753)

* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add pidfd_open() tests
  arch: wire-up pidfd_open()
  pid: add pidfd_open()
  pidfd: add polling selftests
  pidfd: add polling support
2019-07-10 22:17:21 -07:00
Arnd Bergmann
0df1c9868c timekeeping/vsyscall: Use __iter_div_u64_rem()
On 32-bit x86 when building with clang-9, the 'division' loop gets turned
back into an inefficient division that causes a link error:

kernel/time/vsyscall.o: In function `update_vsyscall':
vsyscall.c:(.text+0xe3): undefined reference to `__udivdi3'

Use the existing __iter_div_u64_rem() function which is used to address the
same issue in other places.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://lkml.kernel.org/r/20190710130206.1670830-1-arnd@arndb.de
2019-07-10 20:37:49 +02:00
Linus Torvalds
e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Linus Torvalds
608745f124 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle on the kernel side were:

   - CPU PMU and uncore driver updates to Intel Snow Ridge, IceLake,
     KabyLake, AmberLake and WhiskeyLake CPUs.

   - Rework the MSR probing infrastructure to make it more robust, make
     it work better on virtualized systems and to better expose it on
     sysfs.

   - Rework PMU attributes group support based on the feedback from
     Greg. The core sysfs patch that adds sysfs_update_groups() was
     acked by Greg.

  There's a lot of perf tooling changes as well, all around the place:

   - vendor updates to Intel, cs-etm (ARM), ARM64, s390,

   - various enhancements to Intel PT tooling support:
      - Improve CBR (Core to Bus Ratio) packets support.
      - Export power and ptwrite events to sqlite and postgresql.
      - Add support for decoding PEBS via PT packets.
      - Add support for samples to contain IPC ratio, collecting cycles
        information from CYC packets, showing the IPC info periodically
      - Allow using time ranges

   - lots of updates to perf pmu, perf stat, perf trace, eBPF support,
     perf record, perf diff, etc. - please see the shortlog and Git log
     for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (252 commits)
  tools arch x86: Sync asm/cpufeatures.h with the with the kernel
  tools build: Check if gettid() is available before providing helper
  perf jvmti: Address gcc string overflow warning for strncpy()
  perf python: Remove -fstack-protector-strong if clang doesn't have it
  perf annotate TUI browser: Do not use member from variable within its own initialization
  perf tests: Fix record+probe_libc_inet_pton.sh for powerpc64
  perf evsel: Do not rely on errno values for precise_ip fallback
  perf thread: Allow references to thread objects after machine__exit()
  perf header: Assign proper ff->ph in perf_event__synthesize_features()
  tools arch kvm: Sync kvm headers with the kernel sources
  perf script: Allow specifying the files to process guest samples
  perf tools metric: Don't include duration_time in group
  perf list: Avoid extra : for --raw metrics
  perf vendor events intel: Metric fixes for SKX/CLX
  perf tools: Fix typos / broken sentences
  perf jevents: Add support for Hisi hip08 L3C PMU aliasing
  perf jevents: Add support for Hisi hip08 HHA PMU aliasing
  perf jevents: Add support for Hisi hip08 DDRC PMU aliasing
  perf pmu: Support more complex PMU event aliasing
  perf diff: Documentation -c cycles option
  ...
2019-07-09 11:15:52 -07:00
Linus Torvalds
3b99107f0e for-5.3/block-20190708
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay
 s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD
 J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm
 v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx
 G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk
 3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP
 EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB
 6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7
 z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7
 ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo
 YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx
 iXJssPRo4w==
 =VSYT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main block updates for 5.3. Nothing earth shattering or
  major in here, just fixes, additions, and improvements all over the
  map. This contains:

   - Series of documentation fixes (Bart)

   - Optimization of the blk-mq ctx get/put (Bart)

   - null_blk removal race condition fix (Bob)

   - req/bio_op() cleanups (Chaitanya)

   - Series cleaning up the segment accounting, and request/bio mapping
     (Christoph)

   - Series cleaning up the page getting/putting for bios (Christoph)

   - block cgroup cleanups and moving it to where it is used (Christoph)

   - block cgroup fixes (Tejun)

   - Series of fixes and improvements to bcache, most notably a write
     deadlock fix (Coly)

   - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

   - Series of improvements and fixes to BFQ (Douglas, Paolo)

   - debugfs_create() return value check removal for drbd (Greg)

   - Use struct_size(), where appropriate (Gustavo)

   - Two lighnvm fixes (Heiner, Geert)

   - MD fixes, including a read balance and corruption fix (Guoqing,
     Marcos, Xiao, Yufen)

   - block opal shadow mbr additions (Jonas, Revanth)

   - sbitmap compare-and-exhange improvemnts (Pavel)

   - Fix for potential bio->bi_size overflow (Ming)

   - NVMe pull requests:
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"

   - Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
  blk-iolatency: fix STS_AGAIN handling
  block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
  blk-mq: simplify blk_mq_make_request()
  blk-mq: remove blk_mq_put_ctx()
  sbitmap: Replace cmpxchg with xchg
  block: fix .bi_size overflow
  block: sed-opal: check size of shadow mbr
  block: sed-opal: ioctl for writing to shadow mbr
  block: sed-opal: add ioctl for done-mark of shadow mbr
  block: never take page references for ITER_BVEC
  direct-io: use bio_release_pages in dio_bio_complete
  block_dev: use bio_release_pages in bio_unmap_user
  block_dev: use bio_release_pages in blkdev_bio_end_io
  iomap: use bio_release_pages in iomap_dio_bio_end_io
  block: use bio_release_pages in bio_map_user_iov
  block: use bio_release_pages in bio_unmap_user
  block: optionally mark pages dirty in bio_release_pages
  block: move the BIO_NO_PAGE_REF check into bio_release_pages
  block: skd_main.c: Remove call to memset after dma_alloc_coherent
  block: mtip32xx: Remove call to memset after dma_alloc_coherent
  ...
2019-07-09 10:45:06 -07:00
Linus Torvalds
cf2d213e49 Power management updates for 5.3-rc1
- Improve the handling of shared ACPI power resources in the PCI
    bus type layer (Mika Westerberg).
 
  - Make the PCI layer take link delays required by the PCIe spec
    into account as appropriate and avoid polling devices in D3cold
    for PME (Mika Westerberg).
 
  - Fix some corner case issues in ACPI device power management and
    in the PCI bus type layer, optimiza and clean up the handling of
    runtime-suspended PCI devices during system-wide transitions to
    sleep states (Rafael Wysocki).
 
  - Rework hibernation handling in the ACPI core and the PCI bus type
    to resume runtime-suspended devices before hibernation (which
    allows some functional problems to be avoided) and fix some ACPI
    power management issues related to hiberation (Rafael Wysocki).
 
  - Extend the operating performance points (OPP) framework to support
    a wider range of devices (Rajendra Nayak, Stehpen Boyd).
 
  - Fix issues related to genpd_virt_devs and issues with platforms
    using the set_opp() callback in the OPP framework (Viresh Kumar,
    Dmitry Osipenko).
 
  - Add new cpufreq driver for Raspberry Pi (Nicolas Saenz Julienne).
 
  - Add new cpufreq driver for imx8m and imx7d chips (Leonard Crestez).
 
  - Fix and clean up the pcc-cpufreq, brcmstb-avs-cpufreq, s5pv210,
    and armada-37xx cpufreq drivers (David Arcari, Florian Fainelli,
    Paweł Chmiel, YueHaibing).
 
  - Clean up and fix the cpufreq core (Viresh Kumar, Daniel Lezcano).
 
  - Fix minor issue in the ACPI system sleep support code and export
    one function from it (Lenny Szubowicz, Dexuan Cui).
 
  - Clean up assorted pieces of PM code and documentation (Kefeng Wang,
    Andy Shevchenko, Bart Van Assche, Greg Kroah-Hartman, Fuqian Huang,
    Geert Uytterhoeven, Mathieu Malaterre, Rafael Wysocki).
 
  - Update the pm-graph utility to v5.4 (Todd Brandt).
 
  - Fix and clean up the cpupower utility (Abhishek Goel, Nick Black).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl0jK18SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxgEAP/RbPe71Y9ufKu64L3EtgV6mS9iuLEhux
 /Ad9laLNeM1b0oceT3QxGk7xQCacnZcBlcaqXVWI4NRsn4RBZp1cYZngpgJ9DP6E
 ONr8hzyzDOMVReba3XJIF8H+WoTKjywMYtFutjdx6dRe2ZJLutqnuZ0JbH1YSSK7
 IxOt0mJVALf2M4Zz7F17d+n3yGE/4xAPBVbj/rBRcTEsGYlR/Hoxs7iF6EBau7fy
 R5drUH6XSrWk8adc+z7l3BTGqMMYj9deRSfAWB3wpM4YK7Fv7msX/amBoGINkdn6
 xP/ZcrHvhKKzE89MS8OUGP4rGVwq+7tu6mktnYL/tpKgutJqqx5LVvrLsGDSWr+W
 /aJExN8Eb4Jh98C6vog3XUJoqBxkVGbU8qoCBU3jlFsaznFEWjW9IKhBHs5CIaqz
 MXte6AsJ8lvFzxILjvx0m2206wNpRJRXYLX3a/BSBxa4OgOESjIpBTmbPfOwbxwj
 8z9hIVlDTTDtnF6BEyDQr1fjPi3Mxl7ibGnoqRrJm36VKBy9VZNNwG/0Y2oSvm6k
 Es8CiTWA3A/46dCZxGr18/9Vbfxn1Yvg9QZ1lCE5Fqij+0F2yRApbHZgUro+1rji
 6J8OWC5r5JccdKuGHh4RH/asMFhD0cAR/gUsRzS4dz/wz2jVIN1FstdD1aKN5GBy
 d0lchx/AKR5H
 =aBN3
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These update PCI and ACPI power management (improved handling of ACPI
  power resources and PCIe link delays, fixes related to corner cases,
  hibernation handling rework), fix and extend the operating performance
  points (OPP) framework, add new cpufreq drivers for Raspberry Pi and
  imx8m chips, update some other cpufreq drivers, clean up assorted
  pieces of PM code and documentation and update tools.

  Specifics:

   - Improve the handling of shared ACPI power resources in the PCI bus
     type layer (Mika Westerberg).

   - Make the PCI layer take link delays required by the PCIe spec into
     account as appropriate and avoid polling devices in D3cold for PME
     (Mika Westerberg).

   - Fix some corner case issues in ACPI device power management and in
     the PCI bus type layer, optimiza and clean up the handling of
     runtime-suspended PCI devices during system-wide transitions to
     sleep states (Rafael Wysocki).

   - Rework hibernation handling in the ACPI core and the PCI bus type
     to resume runtime-suspended devices before hibernation (which
     allows some functional problems to be avoided) and fix some ACPI
     power management issues related to hiberation (Rafael Wysocki).

   - Extend the operating performance points (OPP) framework to support
     a wider range of devices (Rajendra Nayak, Stehpen Boyd).

   - Fix issues related to genpd_virt_devs and issues with platforms
     using the set_opp() callback in the OPP framework (Viresh Kumar,
     Dmitry Osipenko).

   - Add new cpufreq driver for Raspberry Pi (Nicolas Saenz Julienne).

   - Add new cpufreq driver for imx8m and imx7d chips (Leonard Crestez).

   - Fix and clean up the pcc-cpufreq, brcmstb-avs-cpufreq, s5pv210, and
     armada-37xx cpufreq drivers (David Arcari, Florian Fainelli, Paweł
     Chmiel, YueHaibing).

   - Clean up and fix the cpufreq core (Viresh Kumar, Daniel Lezcano).

   - Fix minor issue in the ACPI system sleep support code and export
     one function from it (Lenny Szubowicz, Dexuan Cui).

   - Clean up assorted pieces of PM code and documentation (Kefeng Wang,
     Andy Shevchenko, Bart Van Assche, Greg Kroah-Hartman, Fuqian Huang,
     Geert Uytterhoeven, Mathieu Malaterre, Rafael Wysocki).

   - Update the pm-graph utility to v5.4 (Todd Brandt).

   - Fix and clean up the cpupower utility (Abhishek Goel, Nick Black)"

* tag 'pm-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (57 commits)
  ACPI: PM: Make acpi_sleep_state_supported() non-static
  PM: sleep: Drop dev_pm_skip_next_resume_phases()
  ACPI: PM: Unexport acpi_device_get_power()
  Documentation: ABI: power: Add missing newline at end of file
  ACPI: PM: Drop unused function and function header
  ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
  ACPI: PM: Simplify and fix PM domain hibernation callbacks
  PCI: PM: Simplify bus-level hibernation callbacks
  PM: ACPI/PCI: Resume all devices during hibernation
  cpufreq: Avoid calling cpufreq_verify_current_freq() from handle_update()
  cpufreq: Consolidate cpufreq_update_current_freq() and __cpufreq_get()
  kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
  cpufreq: Don't skip frequency validation for has_target() drivers
  PCI: PM/ACPI: Refresh all stale power state data in pci_pm_complete()
  PCI / ACPI: Add _PR0 dependent devices
  ACPI / PM: Introduce concept of a _PR0 dependent device
  PCI / ACPI: Use cached ACPI device state to get PCI device power state
  ACPI: PM: Allow transitions to D0 to occur in special cases
  ACPI: PM: Avoid evaluating _PS3 on transitions from D3hot to D3cold
  cpufreq: Use has_target() instead of !setpolicy
  ...
2019-07-09 10:05:22 -07:00
Josh Poimboeuf
e55a73251d bpf: Fix ORC unwinding in non-JIT BPF code
Objtool previously ignored ___bpf_prog_run() because it didn't understand
the jump table.  This resulted in the ORC unwinder not being able to unwind
through non-JIT BPF code.

Now that objtool knows how to read jump tables, remove the whitelist and
annotate the jump table so objtool can recognize it.

Also add an additional "const" to the jump table definition to clarify that
the text pointers are constant.  Otherwise GCC sets the section writable
flag and the assembler spits out warnings.

Fixes: d15d356887 ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kairui Song <kasong@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lkml.kernel.org/r/881939122b88f32be4c374d248c09d7527a87e35.1561685471.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-09 13:55:57 +02:00
Linus Torvalds
5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds
92c1d65221 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Documentation updates and the addition of cgroup_parse_float() which
  will be used by new controllers including blk-iocost"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: convert docs to ReST and rename to *.rst
  cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
  cgroup: add cgroup_parse_float()
2019-07-08 21:35:12 -07:00
Linus Torvalds
df2a40f549 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Just a couple cleanup patches.  No functional changes."

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Remove GPF argument from alloc_workqueue_attrs()
  workqueue: Make alloc/apply/free_workqueue_attrs() static
2019-07-08 21:10:01 -07:00
Linus Torvalds
8b68150883 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
 "Bug fixes, code clean up, and new features:

   - IMA policy rules can be defined in terms of LSM labels, making the
     IMA policy dependent on LSM policy label changes, in particular LSM
     label deletions. The new environment, in which IMA-appraisal is
     being used, frequently updates the LSM policy and permits LSM label
     deletions.

   - Prevent an mmap'ed shared file opened for write from also being
     mmap'ed execute. In the long term, making this and other similar
     changes at the VFS layer would be preferable.

   - The IMA per policy rule template format support is needed for a
     couple of new/proposed features (eg. kexec boot command line
     measurement, appended signatures, and VFS provided file hashes).

   - Other than the "boot-aggregate" record in the IMA measuremeent
     list, all other measurements are of file data. Measuring and
     storing the kexec boot command line in the IMA measurement list is
     the first buffer based measurement included in the measurement
     list"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Introduce struct evm_xattr
  ima: Update MAX_TEMPLATE_NAME_LEN to fit largest reasonable definition
  KEXEC: Call ima_kexec_cmdline to measure the boot command line args
  IMA: Define a new template field buf
  IMA: Define a new hook to measure the kexec boot command line arguments
  IMA: support for per policy rule template formats
  integrity: Fix __integrity_init_keyring() section mismatch
  ima: Use designated initializers for struct ima_event_data
  ima: use the lsm policy update notifier
  LSM: switch to blocking policy update notifiers
  x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICY
  ima: Make arch_policy_entry static
  ima: prevent a file already mmap'ed write to be mmap'ed execute
  x86/ima: check EFI SetupMode too
2019-07-08 20:28:59 -07:00
David S. Miller
af144a9834 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two cases of overlapping changes, nothing fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:48:57 -07:00
Linus Torvalds
c84ca912b0 Keyrings namespacing
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRU89Pu3V2unywtrAQIdBBAAmMBsrfv+LUN4Vru/D6KdUO4zdYGcNK6m
 S56bcNfP6oIDEj6HrNNnzKkWIZpdZ61Odv1zle96+v4WZ/6rnLCTpcsdaFNTzaoO
 YT2jk7jplss0ImrMv1DSoykGqO3f0ThMIpGCxHKZADGSu0HMbjSEh+zLPV4BaMtT
 BVuF7P3eZtDRLdDtMtYcgvf5UlbdoBEY8w1FUjReQx8hKGxVopGmCo5vAeiY8W9S
 ybFSZhPS5ka33ynVrLJH2dqDo5A8pDhY8I4bdlcxmNtRhnPCYZnuvTqeAzyUKKdI
 YN9zJeDu1yHs9mi8dp45NPJiKy6xLzWmUwqH8AvR8MWEkrwzqbzNZCEHZ41j74hO
 YZWI0JXi72cboszFvOwqJERvITKxrQQyVQLPRQE2vVbG0bIZPl8i7oslFVhitsl+
 evWqHb4lXY91rI9cC6JIXR1OiUjp68zXPv7DAnxv08O+PGcioU1IeOvPivx8QSx4
 5aUeCkYIIAti/GISzv7xvcYh8mfO76kBjZSB35fX+R9DkeQpxsHmmpWe+UCykzWn
 EwhHQn86+VeBFP6RAXp8CgNCLbrwkEhjzXQl/70s1eYbwvK81VcpDAQ6+cjpf4Hb
 QUmrUJ9iE0wCNl7oqvJZoJvWVGlArvPmzpkTJk3N070X2R0T7x1WCsMlPDMJGhQ2
 fVHvA3QdgWs=
 =Push
 -----END PGP SIGNATURE-----

Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Linus Torvalds
c236b6dd48 request_key improvements
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRPObfu3V2unywtrAQJLKA//WENO5pZDHe49T+4GCY0ZmnGHKBUnU7g9
 DUjxSNS8a/nwCyEdApZk9uHp2xsOedP6pjQ4VRWMQfrIPx0Yh9o3J+BQxvyP7PDf
 jEH+5CYC8dZnJJjjteWCcPEGrUoNb1YKfDRBU745YY+rLdHWvhHc27B6SYBg5BGT
 OwW3qyHvp0WMp7TehMALdnkqGph5gR5QMr45tOrH6DkGAhN8mAIKD699d3MqZG73
 +S5KlQOlDlEVrxbD/BgzlzEJQUBQyq8hd61taBFT7LXBNlLJJOnMhd7UJY5IJE7J
 Vi9NpcLj4Emwv4wvZ2xneV0rMbsCbxRMKZLDRuqQ6Tm17xjpjro4n1ujneTAqmmy
 d+XlrVQ2ZMciMNmGleezOoBib9QbY5NWdilc2ls5ydFGiBVL73bIOYtEQNai8lWd
 LBBIIrxOmLO7bnipgqVKRnqeMdMkpWaLISoRfSeJbRt4lGxmka9bDBrSgONnxzJK
 JG+sB8ahSVZaBbhERW8DKnBz61Yf8ka7ijVvjH3zCXu0rbLTy+LLUz5kbzbBP9Fc
 LiUapLV/v420gD2ZRCgPQwtQui4TpBkSGJKS1Ippyn7LGBNCZLM4Y8vOoo4nqr7z
 RhpEKbKeOdVjORaYjO8Zttj8gN9rT6WnPcyCTHdNEnyjotU1ykyVBkzexj+VYvjM
 C3eIdjG7Jk0=
 =c2FO
 -----END PGP SIGNATURE-----

Merge tag 'keys-request-20190626' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull request_key improvements from David Howells:
 "These are all request_key()-related, including a fix and some improvements:

   - Fix the lack of a Link permission check on a key found by
     request_key(), thereby enabling request_key() to link keys that
     don't grant this permission to the target keyring (which must still
     grant Write permission).

     Note that the key must be in the caller's keyrings already to be
     found.

   - Invalidate used request_key authentication keys rather than
     revoking them, so that they get cleaned up immediately rather than
     hanging around till the expiry time is passed.

   - Move the RCU locks outwards from the keyring search functions so
     that a request_key_rcu() can be provided. This can be called in RCU
     mode, so it can't sleep and can't upcall - but it can be called
     from LOOKUP_RCU pathwalk mode.

   - Cache the latest positive result of request_key*() temporarily in
     task_struct so that filesystems that make a lot of request_key()
     calls during pathwalk can take advantage of it to avoid having to
     redo the searching. This requires CONFIG_KEYS_REQUEST_CACHE=y.

     It is assumed that the key just found is likely to be used multiple
     times in each step in an RCU pathwalk, and is likely to be reused
     for the next step too.

     Note that the cleanup of the cache is done on TIF_NOTIFY_RESUME,
     just before userspace resumes, and on exit"

* tag 'keys-request-20190626' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Kill off request_key_async{,_with_auxdata}
  keys: Cache result of request_key*() temporarily in task_struct
  keys: Provide request_key_rcu()
  keys: Move the RCU locks outwards from the keyring search functions
  keys: Invalidate used request_key authentication keys
  keys: Fix request_key() lack of Link perm check on found key
2019-07-08 19:19:37 -07:00
Linus Torvalds
d44a62742d Keyrings miscellany
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXQo23fu3V2unywtrAQJghA/+Oi2W9tSfz67zMupYiqa71x5Zg5XlUVIz
 RJxSIwYhE4bhGwodTmqgRlT6f64Gbgt0K8YapGUIbtV/T6d1w02oEmt0V9vad9Zi
 wTH79hH5QKNvewUDhrWODsWhtOBWu1sGt9OozI+c65lsvTpHY4Ox7zIl4DtfBdNK
 nLUxl82h7EHF9H4TtIKxfKlLkIkmt7NRbK3z1eUP+IG/7MBzoyXgXo/gvoHUCOMR
 lhGxttZfxYdZuR9JoR2FBckvKulgafbwjoUc69EDfr8a8IZZrpaUuSTvSPbCfzj1
 j0yXfoowiWvsI1lFFBHeE0BfteJRQ9O2Pkwh1Z9M6v4zjwNNprDOw9a3VroeSgS/
 OWJyHNjeNLDMMZDm1YYCYs0B416q+lZtdAoE/nhR/lGZlBfKTyAa6Cfo4r0RBpYb
 zAxk6K4HcLBL0dkxkTXkxUJPnoDts5bMEL3YuZeVWd7Ef5s5GHW34JI+CFrMR29s
 fC9W+ZEZ74fVo2goPz2ekeiSyp28TkWusXxUCk07g0BsXQzB7v5XXUGtU9hAJ6pe
 aMBfLwAvQkkGi56CPnGWn6WlZ+AgxbRqnlYWpWf0q+PLiuyo4OeRZzhn6AdNQcCR
 2QsTBILOvZbhjEki84ZfsuLLq2k79C2xluEd9JlSAvx5/D93xjMB2qVzR1M6DbdA
 +u1nS8Z6WHA=
 =Oy7N
 -----END PGP SIGNATURE-----

Merge tag 'keys-misc-20190619' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc keyring updates from David Howells:
 "These are some miscellaneous keyrings fixes and improvements:

   - Fix a bunch of warnings from sparse, including missing RCU bits and
     kdoc-function argument mismatches

   - Implement a keyctl to allow a key to be moved from one keyring to
     another, with the option of prohibiting key replacement in the
     destination keyring.

   - Grant Link permission to possessors of request_key_auth tokens so
     that upcall servicing daemons can more easily arrange things such
     that only the necessary auth key is passed to the actual service
     program, and not all the auth keys a daemon might possesss.

   - Improvement in lookup_user_key().

   - Implement a keyctl to allow keyrings subsystem capabilities to be
     queried.

  The keyutils next branch has commits to make available, document and
  test the move-key and capabilities code:

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log

  They're currently on the 'next' branch"

* tag 'keys-misc-20190619' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Add capability-checking keyctl function
  keys: Reuse keyring_index_key::desc_len in lookup_user_key()
  keys: Grant Link permission to possessers of request_key auth keys
  keys: Add a keyctl to move a key between keyrings
  keys: Hoist locking out of __key_link_begin()
  keys: Break bits out of key_unlink()
  keys: Change keyring_serialise_link_sem to a mutex
  keys: sparse: Fix kdoc mismatches
  keys: sparse: Fix incorrect RCU accesses
  keys: sparse: Fix key_fs[ug]id_changed()
2019-07-08 19:02:11 -07:00
Linus Torvalds
61fc5771f5 audit/stable-5.3 PR 20190702
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl0bgNYUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXONcRAAqpeGVh3/eU5bmGeiOWZJ5TREx0Qf
 4M8Z3CElxtbPF4nz1nARUbH424zF91AOa0B4JVO8BFCgxWN5M3dDOLjqLLfJkfbE
 mQMmiPoua1qXTMRi/9S+3kNFYO4IL/sFFiiqY6XVcW6xIUzp3rLwEjcHC/deszP7
 /e8IqLUFAqj853W0k7qyLMRFEQVBzrABgtiSX+X06sCB8OmAVxhpevSRR1lmmfEu
 sjwuAvxexVlmojwI6HkoANyRzqJRX6y7sMGSbr10I/T9YJTk4VPfeFwSS3qBsf15
 z9gTbvFrRcXKoA9U8iG45K0lUinka9OuGxJD/AxuJv+ncyJjWqX+aokvzeo7Wmv6
 sbAyD+ikl9kxvE+sZ3l9yZEVHjFIbjmZY/gzG+ZZD2EEwKBuaQBN5mmSjrUkySJk
 sbF+oBABLptitJIa/cZJ5QHeAPR1NBqSXKhnhG26IR8iwQqpZhefa8yXpF/x3Tn8
 FckvY+YpIakOAMQ/ezVvFaaEELieiRZqqI/ShrochJzwRXHnnbCTPRtNb9NyjOeU
 DZCBASPhrYfBJz3n0fZR2HCnpMZwCSGBgmVn3jmh3YyxKnILdQ4DxKgJCv730jwh
 9T1+1g2/MW554Gted7KLlkE+aj+BzORx6XJ9H8SKmYB85NF5KnnJMiVktjfl4Jr4
 A8meV9KGwAcyBOU=
 =8HBN
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "This pull request is a bit early, but with some vacation time coming
  up I wanted to send this out now just in case the remote Internet Gods
  decide not to smile on me once the merge window opens. The patchset
  for v5.3 is pretty minor this time, the highlights include:

   - When the audit daemon is sent a signal, ensure we deliver
     information about the sender even when syscall auditing is not
     enabled/supported.

   - Add the ability to filter audit records based on network address
     family.

   - Tighten the audit field filtering restrictions on string based
     fields.

   - Cleanup the audit field filtering verification code.

   - Remove a few BUG() calls from the audit code"

* tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: remove the BUG() calls in the audit rule comparison functions
  audit: enforce op for string fields
  audit: add saddr_fam filter field
  audit: re-structure audit field valid checks
  audit: deliver signal_info regarless of syscall
2019-07-08 18:55:42 -07:00
Masahiro Yamada
7199ff7d74 kheaders: include only headers into kheaders_data.tar.xz
Currently, kheaders_data.tar.xz contains some build scripts as well as
headers. None of them is needed in the header archive.

For ARCH=x86, this commit excludes the following from the archive:

  arch/x86/include/asm/Kbuild
  arch/x86/include/uapi/asm/Kbuild
  include/asm-generic/Kbuild
  include/config/auto.conf
  include/config/kernel.release
  include/config/tristate.conf
  include/uapi/asm-generic/Kbuild
  include/uapi/Kbuild
  kernel/gen_kheaders.sh

This change is actually motivated for the planned header compile-testing
because it will generate more build artifacts, which should not be
included in the archive.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-07-09 10:10:52 +09:00
Masahiro Yamada
b60b7c2ea9 kheaders: remove meaningless -R option of 'ls'
The -R option of 'ls' is supposed to be used for directories.

   -R, --recursive
          list subdirectories recursively

Since 'find ... -type f' only matches to regular files, we do not
expect directories passed to the 'ls' command here.

Giving -R is harmless at least, but unneeded.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-07-09 10:10:52 +09:00
Linus Torvalds
dad1c12ed8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Remove the unused per rq load array and all its infrastructure, by
   Dietmar Eggemann.

 - Add utilization clamping support by Patrick Bellasi. This is a
   refinement of the energy aware scheduling framework with support for
   boosting of interactive and capping of background workloads: to make
   sure critical GUI threads get maximum frequency ASAP, and to make
   sure background processing doesn't unnecessarily move to cpufreq
   governor to higher frequencies and less energy efficient CPU modes.

 - Add the bare minimum of tracepoints required for LISA EAS regression
   testing, by Qais Yousef - which allows automated testing of various
   power management features, including energy aware scheduling.

 - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
   kernel used to modify the scheduler's CPU affinity logic such as
   migrate_disable() - introduce the task->cpus_ptr value instead of
   taking the address of &task->cpus_allowed directly - by Sebastian
   Andrzej Siewior.

 - Misc optimizations, fixes, cleanups and small enhancements - see the
   Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched/uclamp: Add uclamp support to energy_compute()
  sched/uclamp: Add uclamp_util_with()
  sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
  sched/uclamp: Set default clamps for RT tasks
  sched/uclamp: Reset uclamp values on RESET_ON_FORK
  sched/uclamp: Extend sched_setattr() to support utilization clamping
  sched/core: Allow sched_setattr() to use the current policy
  sched/uclamp: Add system default clamps
  sched/uclamp: Enforce last task's UCLAMP_MAX
  sched/uclamp: Add bucket local max tracking
  sched/uclamp: Add CPU's clamp buckets refcounting
  sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
  sched/debug: Export the newly added tracepoints
  sched/debug: Add sched_overutilized tracepoint
  sched/debug: Add new tracepoint to track PELT at se level
  sched/debug: Add new tracepoints to track PELT at rq level
  sched/debug: Add a new sched_trace_*() helper functions
  sched/autogroup: Make autogroup_path() always available
  sched/wait: Deduplicate code with do-while
  sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
  ...
2019-07-08 16:39:53 -07:00
Linus Torvalds
e192832869 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - rwsem scalability improvements, phase #2, by Waiman Long, which are
     rather impressive:

       "On a 2-socket 40-core 80-thread Skylake system with 40 reader
        and writer locking threads, the min/mean/max locking operations
        done in a 5-second testing window before the patchset were:

         40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
         40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

        After the patchset, they became:

         40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
         40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

     There's a lot of changes to the locking implementation that makes
     it similar to qrwlock, including owner handoff for more fair
     locking.

     Another microbenchmark shows how across the spectrum the
     improvements are:

       "With a locking microbenchmark running on 5.1 based kernel, the
        total locking rates (in kops/s) on a 2-socket Skylake system
        with equal numbers of readers and writers (mixed) before and
        after this patchset were:

        # of Threads   Before Patch      After Patch
        ------------   ------------      -----------
             2            2,618             4,193
             4            1,202             3,726
             8              802             3,622
            16              729             3,359
            32              319             2,826
            64              102             2,744"

     The changes are extensive and the patch-set has been through
     several iterations addressing various locking workloads. There
     might be more regressions, but unless they are pathological I
     believe we want to use this new implementation as the baseline
     going forward.

   - jump-label optimizations by Daniel Bristot de Oliveira: the primary
     motivation was to remove IPI disturbance of isolated RT-workload
     CPUs, which resulted in the implementation of batched jump-label
     updates. Beyond the improvement of the real-time characteristics
     kernel, in one test this patchset improved static key update
     overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
     as well.

   - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
     ~10 years of atomic64_t existence the various types used by the
     APIs only had to be self-consistent within each architecture -
     which means they became wildly inconsistent across architectures.
     Mark puts and end to this by reworking all the atomic64
     implementations to use 's64' as the base type for atomic64_t, and
     to ensure that this type is consistently used for parameters and
     return values in the API, avoiding further problems in this area.

   - A large set of small improvements to lockdep by Yuyang Du: type
     cleanups, output cleanups, function return type and othr cleanups
     all around the place.

   - A set of percpu ops cleanups and fixes by Peter Zijlstra.

   - Misc other changes - please see the Git log for more details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  locking/lockdep: increase size of counters for lockdep statistics
  locking/atomics: Use sed(1) instead of non-standard head(1) option
  locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
  x86/jump_label: Make tp_vec_nr static
  x86/percpu: Optimize raw_cpu_xchg()
  x86/percpu, sched/fair: Avoid local_clock()
  x86/percpu, x86/irq: Relax {set,get}_irq_regs()
  x86/percpu: Relax smp_processor_id()
  x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  locking/rwsem: Guard against making count negative
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Make rwsem->owner an atomic_long_t
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  ...
2019-07-08 16:12:03 -07:00
Linus Torvalds
46f1ec23a4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The changes in this cycle are:

   - RCU flavor consolidation cleanups and optmizations

   - Documentation updates

   - Miscellaneous fixes

   - SRCU updates

   - RCU-sync flavor consolidation

   - Torture-test updates

   - Linux-kernel memory-consistency-model updates, most notably the
     addition of plain C-language accesses"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  tools/memory-model: Improve data-race detection
  tools/memory-model: Change definition of rcu-fence
  tools/memory-model: Expand definition of barrier
  tools/memory-model: Do not use "herd" to refer to "herd7"
  tools/memory-model: Fix comment in MP+poonceonces.litmus
  Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()
  rcu: Don't return a value from rcu_assign_pointer()
  rcu: Force inlining of rcu_read_lock()
  rcu: Fix irritating whitespace error in rcu_assign_pointer()
  rcu: Upgrade sync_exp_work_done() to smp_mb()
  rcutorture: Upper case solves the case of the vanishing NULL pointer
  torture: Suppress propagating trace_printk() warning
  rcutorture: Dump trace buffer for callback pipe drain failures
  torture: Add --trust-make to suppress "make clean"
  torture: Make --cpus override idleness calculations
  torture: Run kernel build in source directory
  torture: Add function graph-tracing cheat sheet
  torture: Capture qemu output
  rcutorture: Tweak kvm options
  rcutorture: Add trivial RCU implementation
  ...
2019-07-08 15:45:14 -07:00
Linus Torvalds
0902d5011c Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x96 apic updates from Thomas Gleixner:
 "Updates for the x86 APIC interrupt handling and APIC timer:

   - Fix a long standing issue with spurious interrupts which was caused
     by the big vector management rework a few years ago. Robert Hodaszi
     provided finally enough debug data and an excellent initial failure
     analysis which allowed to understand the underlying issues.

     This contains a change to the core interrupt management code which
     is required to handle this correctly for the APIC/IO_APIC. The core
     changes are NOOPs for most architectures except ARM64. ARM64 is not
     impacted by the change as confirmed by Marc Zyngier.

   - Newer systems allow to disable the PIT clock for power saving
     causing panic in the timer interrupt delivery check of the IO/APIC
     when the HPET timer is not enabled either. While the clock could be
     turned on this would cause an endless whack a mole game to chase
     the proper register in each affected chipset.

     These systems provide the relevant frequencies for TSC, CPU and the
     local APIC timer via CPUID and/or MSRs, which allows to avoid the
     PIT/HPET based calibration. As the calibration code is the only
     usage of the legacy timers on modern systems and is skipped anyway
     when the frequencies are known already, there is no point in
     setting up the PIT and actually checking for the interrupt delivery
     via IO/APIC.

     To achieve this on a wide variety of platforms, the CPUID/MSR based
     frequency readout has been made more robust, which also allowed to
     remove quite some workarounds which turned out to be not longer
     required. Thanks to Daniel Drake for analysis, patches and
     verification"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Seperate unused system vectors from spurious entry again
  x86/irq: Handle spurious interrupt after shutdown gracefully
  x86/ioapic: Implement irq_get_irqchip_state() callback
  genirq: Add optional hardware synchronization for shutdown
  genirq: Fix misleading synchronize_irq() documentation
  genirq: Delay deactivation in free_irq()
  x86/timer: Skip PIT initialization on modern chipsets
  x86/apic: Use non-atomic operations when possible
  x86/apic: Make apic_bsp_setup() static
  x86/tsc: Set LAPIC timer period to crystal clock frequency
  x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
  x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
2019-07-08 11:22:57 -07:00
Linus Torvalds
927ba67a63 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:

  Core:

   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.

     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.

   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.

   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow

   - Consolidate the time accessors

   - Small fixes, improvements and cleanups all over the place

  Drivers:

   - Support for the NXP system counter, TI davinci timer

   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.

   - Overhaul of the Tegra driver

   - Delay timer support for IXP4xx

   - Small fixes, improvements and cleanups as usual"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ...
2019-07-08 11:06:29 -07:00
Linus Torvalds
2a1ccd3142 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq departement provides the usual mixed bag:

  Core:

   - Further improvements to the irq timings code which aims to predict
     the next interrupt for power state selection to achieve better
     latency/power balance

   - Add interrupt statistics to the core NMI handlers

   - The usual small fixes and cleanups

  Drivers:

   - Support for Renesas RZ/A1, Annapurna Labs FIC, Meson-G12A SoC and
     Amazon Gravition AMR/GIC interrupt controllers.

   - Rework of the Renesas INTC controller driver

   - ACPI support for Socionext SoCs

   - Enhancements to the CSKY interrupt controller

   - The usual small fixes and cleanups"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  irq/irqdomain: Fix comment typo
  genirq: Update irq stats from NMI handlers
  irqchip/gic-pm: Remove PM_CLK dependency
  irqchip/al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver
  dt-bindings: interrupt-controller: Add Amazon's Annapurna Labs FIC
  softirq: Use __this_cpu_write() in takeover_tasklets()
  irqchip/mbigen: Stop printing kernel addresses
  irqchip/gic: Add dependency for ARM_GIC_MAX_NR
  genirq/affinity: Remove unused argument from [__]irq_build_affinity_masks()
  genirq/timings: Add selftest for next event computation
  genirq/timings: Add selftest for irqs circular buffer
  genirq/timings: Add selftest for circular array
  genirq/timings: Encapsulate storing function
  genirq/timings: Encapsulate timings push
  genirq/timings: Optimize the period detection speed
  genirq/timings: Fix timings buffer inspection
  genirq/timings: Fix next event index function
  irqchip/qcom: Use struct_size() in devm_kzalloc()
  irqchip/irq-csky-mpintc: Remove unnecessary loop in interrupt handler
  dt-bindings: interrupt-controller: Update csky mpintc
  ...
2019-07-08 11:01:13 -07:00
Linus Torvalds
e0e86b111b Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug updates from Thomas Gleixner:
 "A small set of updates for SMP and CPU hotplug:

   - Abort disabling secondary CPUs in the freezer when a wakeup is
     pending instead of evaluating it only after all CPUs have been
     offlined.

   - Remove the shared annotation for the strict per CPU cfd_data in the
     smp function call core code.

   - Remove the return values of smp_call_function() and on_each_cpu()
     as they are unconditionally 0. Fixup the few callers which actually
     bothered to check the return value"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Remove smp_call_function() and on_each_cpu() return values
  smp: Do not mark call_function_data as shared
  cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
  cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
2019-07-08 10:39:56 -07:00
Linus Torvalds
1758feddb0 s390 updates for the 5.3 merge window
- Improve stop_machine wait logic: replace cpu_relax_yield call in generic
    stop_machine function with a weak stop_machine_yield function. This is
    overridden on s390, which yields the current cpu to the neighbouring cpu
    after a couple of retries, instead of blindly giving up the cpu to the
    hipervisor. This significantly improves stop_machine performance on s390 in
    overcommitted scenarios.
    This includes common code changes which have been Acked by Peter Zijlstra
    and Thomas Gleixner.
 
  - Improve jump label transformation speed: transform jump labels without
    using stop_machine.
 
  - Refactoring of the vfio-ccw cp handling, simplifying the code and
    avoiding unneeded allocating/copying.
 
  - Various vfio-ccw fixes (ccw translation, state machine).
 
  - Add support for vfio-ap queue interrupt control in the guest.
    This includes s390 kvm changes which have been Acked by Christian
    Borntraeger.
 
  - Add protected virtualization support for virtio-ccw.
 
  - Enforce both CONFIG_SMP and CONFIG_HOTPLUG_CPU, which allows to remove some
    code which most likely isn't working at all, besides that s390 didn't even
    compile for !CONFIG_SMP.
 
  - Support for special flagged EP11 CPRBs for zcrypt.
 
  - Handle PCI devices with no support for new MIO instructions.
 
  - Avoid KASAN false positives in reworked stack unwinder.
 
  - Couple of fixes for the QDIO layer.
 
  - Convert s390 specific documentation to ReST format.
 
  - Let s390 crypto modules return -ENODEV instead of -EOPNOTSUPP if hardware is
    missing. This way our modules behave like most other modules and which is
    also what systemd's systemd-modules-load.service expects.
 
  - Replace defconfig with performance_defconfig, so there is one config file
    less to maintain.
 
  - Remove the SCLP call home device driver, which was never useful.
 
  - Cleanups all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl0iEpcACgkQjYWKoQLX
 FBgtZwf8DOJ6COUG91jKP0RSDlc2YvIMBxopQ38ql1lIsTj5t6DvJ2z3X5uct1wy
 6mMiF01VuyD4V4UXbTJQrihzNx7D4dUh47s2sS+diGHxJyXacVxlmjS5k+6pLIUO
 AyLvtCcoqDPPiThqnSTZFRm/TcfO/25fCG/IdjrFGj1MD09wHpUCh16tmRPTGFlC
 BWZeilDT77fVXnh7Ggn3JB0mQay5PAw2ODOxELHTUBaLmYF8RJPPVKBPmXGl9P1W
 84ESm2p+iALGGWDiTOUad9eu8wyQci/V/R+hFgs0Bz/HRcjznNH5EVvfQNCD4VNF
 g/PET10nIQYZv2BNdi0cwRjR9jCFbw==
 =jp0i
 -----END PGP SIGNATURE-----

Merge tag 's390-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Improve stop_machine wait logic: replace cpu_relax_yield call in
   generic stop_machine function with a weak stop_machine_yield
   function. This is overridden on s390, which yields the current cpu to
   the neighbouring cpu after a couple of retries, instead of blindly
   giving up the cpu to the hipervisor. This significantly improves
   stop_machine performance on s390 in overcommitted scenarios.

   This includes common code changes which have been Acked by Peter
   Zijlstra and Thomas Gleixner.

 - Improve jump label transformation speed: transform jump labels
   without using stop_machine.

 - Refactoring of the vfio-ccw cp handling, simplifying the code and
   avoiding unneeded allocating/copying.

 - Various vfio-ccw fixes (ccw translation, state machine).

 - Add support for vfio-ap queue interrupt control in the guest. This
   includes s390 kvm changes which have been Acked by Christian
   Borntraeger.

 - Add protected virtualization support for virtio-ccw.

 - Enforce both CONFIG_SMP and CONFIG_HOTPLUG_CPU, which allows to
   remove some code which most likely isn't working at all, besides that
   s390 didn't even compile for !CONFIG_SMP.

 - Support for special flagged EP11 CPRBs for zcrypt.

 - Handle PCI devices with no support for new MIO instructions.

 - Avoid KASAN false positives in reworked stack unwinder.

 - Couple of fixes for the QDIO layer.

 - Convert s390 specific documentation to ReST format.

 - Let s390 crypto modules return -ENODEV instead of -EOPNOTSUPP if
   hardware is missing. This way our modules behave like most other
   modules and which is also what systemd's systemd-modules-load.service
   expects.

 - Replace defconfig with performance_defconfig, so there is one config
   file less to maintain.

 - Remove the SCLP call home device driver, which was never useful.

 - Cleanups all over the place.

* tag 's390-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (83 commits)
  docs: s390: s390dbf: typos and formatting, update crash command
  docs: s390: unify and update s390dbf kdocs at debug.c
  docs: s390: restore important non-kdoc parts of s390dbf.rst
  vfio-ccw: Fix the conversion of Format-0 CCWs to Format-1
  s390/pci: correctly handle MIO opt-out
  s390/pci: deal with devices that have no support for MIO instructions
  s390: ap: kvm: Enable PQAP/AQIC facility for the guest
  s390: ap: implement PAPQ AQIC interception in kernel
  vfio: ap: register IOMMU VFIO notifier
  s390: ap: kvm: add PQAP interception for AQIC
  s390/unwind: cleanup unused READ_ONCE_TASK_STACK
  s390/kasan: avoid false positives during stack unwind
  s390/qdio: don't touch the dsci in tiqdio_add_input_queues()
  s390/qdio: (re-)initialize tiqdio list entries
  s390/dasd: Fix a precision vs width bug in dasd_feature_list()
  s390/cio: introduce driver_override on the css bus
  vfio-ccw: make convert_ccw0_to_ccw1 static
  vfio-ccw: Remove copy_ccw_from_iova()
  vfio-ccw: Factor out the ccw0-to-ccw1 transition
  vfio-ccw: Copy CCW data outside length calculation
  ...
2019-07-08 10:06:12 -07:00
Linus Torvalds
dfd437a257 arm64 updates for 5.3:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
 
 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly
 
 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)
 
 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG
   and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)
 
 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)
 
 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic
 
 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms
 
 - perf: DDR performance monitor support for iMX8QXP
 
 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers
 
 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent
 
 - arm64 do_page_fault() and hugetlb cleanups
 
 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
 
 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags'
   introduced in 5.1)
 
 - CONFIG_RANDOMIZE_BASE now enabled in defconfig
 
 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area
 
 - Make ZONE_DMA32 configurable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl0eHqcACgkQa9axLQDI
 XvFyNA/+L+bnkz8m3ncydlqqfXomQn4eJJVQ8Uksb0knJz+1+3CUxxbO4ry4jXZN
 fMkbggYrDPRKpDbsUl0lsRipj7jW9bqan+N37c3SWqCkgb6HqDaHViwxdx6Ec/Uk
 gHudozDSPh/8c7hxGcSyt/CFyuW6b+8eYIQU5rtIgz8aVY2BypBvS/7YtYCbIkx0
 w4CFleRTK1zXD5mJQhrc6jyDx659sVkrAvdhf6YIymOY8nBTv40vwdNo3beJMYp8
 Po/+0Ixu+VkHUNtmYYZQgP/AGH96xiTcRnUqd172JdtRPpCLqnLqwFokXeVIlUKT
 KZFMDPzK+756Ayn4z4huEePPAOGlHbJje8JVNnFyreKhVVcCotW7YPY/oJR10bnc
 eo7yD+DxABTn+93G2yP436bNVa8qO1UqjOBfInWBtnNFJfANIkZweij/MQ6MjaTA
 o7KtviHnZFClefMPoiI7HDzwL8XSmsBDbeQ04s2Wxku1Y2xUHLx4iLmadwLQ1ZPb
 lZMTZP3N/T1554MoURVA1afCjAwiqU3bt1xDUGjbBVjLfSPBAn/25IacsG9Li9AF
 7Rp1M9VhrfLftjFFkB2HwpbhRASOxaOSx+EI3kzEfCtM2O9I1WHgP3rvCdc3l0HU
 tbK0/IggQicNgz7GSZ8xDlWPwwSadXYGLys+xlMZEYd3pDIOiFc=
 =0TDT
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly

 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)

 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
   XAFLAG and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)

 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)

 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic

 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms

 - perf: DDR performance monitor support for iMX8QXP

 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers

 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent

 - arm64 do_page_fault() and hugetlb cleanups

 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
   'arm_boot_flags' introduced in 5.1)

 - CONFIG_RANDOMIZE_BASE now enabled in defconfig

 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area

 - Make ZONE_DMA32 configurable

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
  perf: arm_spe: Enable ACPI/Platform automatic module loading
  arm_pmu: acpi: spe: Add initial MADT/SPE probing
  ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
  ACPI/PPTT: Modify node flag detection to find last IDENTICAL
  x86/entry: Simplify _TIF_SYSCALL_EMU handling
  arm64: rename dump_instr as dump_kernel_instr
  arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
  arm64: Implement panic_smp_self_stop()
  arm64: Improve parking of stopped CPUs
  arm64: Expose FRINT capabilities to userspace
  arm64: Expose ARMv8.5 CondM capability to userspace
  arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
  arm64: ARM64_MODULES_PLTS must depend on MODULES
  arm64: bpf: do not allocate executable memory
  arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
  arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
  arm64: module: create module allocations without exec permissions
  arm64: Allow user selection of ARM64_MODULE_PLTS
  acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
  arm64: Allow selecting Pseudo-NMI again
  ...
2019-07-08 09:54:55 -07:00
Ingo Molnar
552a031ba1 Linux 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0idTweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZesIAJDKicw2Voyx8K8m
 3pXSK+71RuO/d3Y9M51mdfTMKRP4PHR9/4wVZ9wHPwC4dV6wxgsmIYCF69a1Wety
 LD1MpDCP1DK5wVfPNKVX2xmj7ua6iutPtSsJHzdzM2TlscgsrFKjmUccqJ5JLwL5
 c34nqwXWnzzRyI5Ga9cQSlwzAXq0vDHXyML3AnCosSsLX0lKFrHlK1zttdOPNkfj
 dXRN62g3q+9kVQozzhDXb8atZZ7IkBk8Q0lujpNXW83Ci1VjaVNv3SB8GZTXIlLj
 U15VdyuwfJDfpBgFBN6/unzVaAB6FFrEKy0jT1aeTyKarMKDKgOnJjn10aKjDNno
 /bXsKKc=
 =TVqV
 -----END PGP SIGNATURE-----

Merge tag 'v5.2' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-08 18:04:41 +02:00
YueHaibing
6705fea0c7 bpf: cgroup: Fix build error without CONFIG_NET
If CONFIG_NET is not set and CONFIG_CGROUP_BPF=y,
gcc building fails:

kernel/bpf/cgroup.o: In function `cg_sockopt_func_proto':
cgroup.c:(.text+0x237e): undefined reference to `bpf_sk_storage_get_proto'
cgroup.c:(.text+0x2394): undefined reference to `bpf_sk_storage_delete_proto'
kernel/bpf/cgroup.o: In function `__cgroup_bpf_run_filter_getsockopt':
(.text+0x2a1f): undefined reference to `lock_sock_nested'
(.text+0x2ca2): undefined reference to `release_sock'
kernel/bpf/cgroup.o: In function `__cgroup_bpf_run_filter_setsockopt':
(.text+0x3006): undefined reference to `lock_sock_nested'
(.text+0x32bb): undefined reference to `release_sock'

Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-08 17:17:00 +02:00
Rafael J. Wysocki
3dbeb44854 Merge branch 'pm-sleep'
* pm-sleep:
  PM: sleep: Drop dev_pm_skip_next_resume_phases()
  ACPI: PM: Drop unused function and function header
  ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
  ACPI: PM: Simplify and fix PM domain hibernation callbacks
  PCI: PM: Simplify bus-level hibernation callbacks
  PM: ACPI/PCI: Resume all devices during hibernation
  kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
  PM: sleep: Update struct wakeup_source documentation
  drivers: base: power: remove wakeup_sources_stats_dentry variable
  PM: suspend: Rename pm_suspend_via_s2idle()
  PM: sleep: Show how long dpm_suspend_start() and dpm_suspend_end() take
  PM: hibernate: powerpc: Expose pfn_is_nosave() prototype
2019-07-08 10:51:25 +02:00
zhengbin
9176ab1b84 time: Validate user input in compat_settimeofday()
The user value is validated after converting the timeval to a timespec, but
for a wide range of negative tv_usec values the multiplication overflow turns
them in positive numbers. So the 'validated later' is not catching the
invalid input.

Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562460701-113301-1-git-send-email-zhengbin13@huawei.com
2019-07-07 12:05:40 +02:00
Zenghui Yu
3a1d24ca95 irq/irqdomain: Fix comment typo
Fix typo in the comment on top of __irq_domain_add().

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562388072-23492-1-git-send-email-yuzenghui@huawei.com
2019-07-06 10:40:20 +02:00
Shijith Thotton
c09cb12935 genirq: Update irq stats from NMI handlers
The NMI handlers handle_percpu_devid_fasteoi_nmi() and handle_fasteoi_nmi()
do not update the interrupt counts. Due to that the NMI interrupt count
does not show up correctly in /proc/interrupts.

Add the statistics and treat the NMI handlers in the same way as per cpu
interrupts and prevent them from updating irq_desc::tot_count as this might
be corrupted due to concurrency.

[ tglx: Massaged changelog ]

Fixes: 2dcf1fbcad ("genirq: Provide NMI handlers")
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562313336-11888-1-git-send-email-sthotton@marvell.com
2019-07-06 10:40:19 +02:00
David S. Miller
c4cde5804d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b38932 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e259ef

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e259ef

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 12:48:21 -07:00
Jann Horn
6994eefb00 ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME
Fix two issues:

When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU
reference to the parent's objective credentials, then give that pointer
to get_cred().  However, the object lifetime rules for things like
struct cred do not permit unconditionally turning an RCU reference into
a stable reference.

PTRACE_TRACEME records the parent's credentials as if the parent was
acting as the subject, but that's not the case.  If a malicious
unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
at a later point, the parent process becomes attacker-controlled
(because it drops privileges and calls execve()), the attacker ends up
with control over two processes with a privileged ptrace relationship,
which can be abused to ptrace a suid binary and obtain root privileges.

Fix both of these by always recording the credentials of the process
that is requesting the creation of the ptrace relationship:
current_cred() can't change under us, and current is the proper subject
for access control.

This change is theoretically userspace-visible, but I am not aware of
any code that it will actually break.

Fixes: 64b875f7ac ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 02:00:41 +09:00
Linus Torvalds
550d1f5bda This includes three fixes:
- Fixes a deadlock from a previous fix to keep module loading
    and function tracing text modifications from stepping on each other.
    (this has a few patches to help document the issue in comments)
 
  - Fix a crash when the snapshot buffer gets out of sync with the
    main ring buffer.
 
  - Fix a memory leak when reading the memory logs
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXRzBCBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnDaAP9qTFBOFtgIGCT5wVP8xjQeESxh1b8R
 tbaT7/U2oPpeiwEAvp1mYo5UYcc8KauBqVaLSLJVN4pv07xiZF5Qgh9C1QE=
 =m2IT
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes three fixes:

   - Fix a deadlock from a previous fix to keep module loading and
     function tracing text modifications from stepping on each other
     (this has a few patches to help document the issue in comments)

   - Fix a crash when the snapshot buffer gets out of sync with the main
     ring buffer

   - Fix a memory leak when reading the memory logs"

* tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare()
  tracing/snapshot: Resize spare buffer if size changed
  tracing: Fix memory leak in tracing_err_log_open()
  ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()
  ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
2019-07-04 10:26:17 +09:00
David S. Miller
c3ead2df97 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-07-03

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
   on BE architectures, from Jiong.

2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
   from Luke and Xi.

3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
   from Stanislav.

4) Properly handle the check that forwarding is enabled on the device
   in bpf_ipv6_fib_lookup() helper code, from Anton.

5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
   alignment such as m68k, from Baruch.

6) Fix kernel hanging in unregister_netdevice loop while unregistering
   device bound to XDP socket, from Ilya.

7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.

8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.

9) Fix bpftool to use correct argument in cgroup errors, from Jakub.

10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.

11) Add Jonathan to AF_XDP reviewers, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 12:09:00 -07:00
Greg Kroah-Hartman
1be51474f9 swiotlb: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612144314.GA16803@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:18 +02:00
Alexei Starovoitov
a3ce685dd0 bpf: fix precision tracking
When equivalent state is found the current state needs to propagate precision marks.
Otherwise the verifier will prune the search incorrectly.

There is a price for correctness:
                      before      before    broken    fixed
                      cnst spill  precise   precise
bpf_lb-DLB_L3.o       1923        8128      1863      1898
bpf_lb-DLB_L4.o       3077        6707      2468      2666
bpf_lb-DUNKNOWN.o     1062        1062      544       544
bpf_lxc-DDROP_ALL.o   166729      380712    22629     36823
bpf_lxc-DUNKNOWN.o    174607      440652    28805     45325
bpf_netdev.o          8407        31904     6801      7002
bpf_overlay.o         5420        23569     4754      4858
bpf_lxc_jit.o         39389       359445    50925     69631
Overall precision tracking is still very effective.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Reported-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Tested-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03 11:12:14 +02:00
Thomas Gleixner
3419240495 Merge branch 'timers/vdso' into timers/core
so the hyper-v clocksource update can be applied.
2019-07-03 10:50:21 +02:00
Thomas Gleixner
62e0468650 genirq: Add optional hardware synchronization for shutdown
free_irq() ensures that no hardware interrupt handler is executing on a
different CPU before actually releasing resources and deactivating the
interrupt completely in a domain hierarchy.

But that does not catch the case where the interrupt is on flight at the
hardware level but not yet serviced by the target CPU. That creates an
interesing race condition:

   CPU 0                  CPU 1               IRQ CHIP

                                              interrupt is raised
                                              sent to CPU1
			  Unable to handle
			  immediately
			  (interrupts off,
			   deep idle delay)
   mask()
   ...
   free()
     shutdown()
     synchronize_irq()
     release_resources()
                          do_IRQ()
                            -> resources are not available

That might be harmless and just trigger a spurious interrupt warning, but
some interrupt chips might get into a wedged state.

Utilize the existing irq_get_irqchip_state() callback for the
synchronization in free_irq().

synchronize_hardirq() is not using this mechanism as it might actually
deadlock unter certain conditions, e.g. when called with interrupts
disabled and the target CPU is the one on which the synchronization is
invoked. synchronize_irq() uses it because that function cannot be called
from non preemtible contexts as it might sleep.

No functional change intended and according to Marc the existing GIC
implementations where the driver supports the callback should be able
to cope with that core change. Famous last words.

Fixes: 464d12309e ("x86/vector: Switch IOAPIC to global reservation mode")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.279463375@linutronix.de
2019-07-03 10:12:29 +02:00
Thomas Gleixner
1d21f2af85 genirq: Fix misleading synchronize_irq() documentation
The function might sleep, so it cannot be called from interrupt
context. Not even with care.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.189241552@linutronix.de
2019-07-03 10:12:29 +02:00
Thomas Gleixner
4001d8e876 genirq: Delay deactivation in free_irq()
When interrupts are shutdown, they are immediately deactivated in the
irqdomain hierarchy. While this looks obviously correct there is a subtle
issue:

There might be an interrupt in flight when free_irq() is invoking the
shutdown. This is properly handled at the irq descriptor / primary handler
level, but the deactivation might completely disable resources which are
required to acknowledge the interrupt.

Split the shutdown code and deactivate the interrupt after synchronization
in free_irq(). Fixup all other usage sites where this is not an issue to
invoke the combined shutdown_and_deactivate() function instead.

This still might be an issue if the interrupt in flight servicing is
delayed on a remote CPU beyond the invocation of synchronize_irq(), but
that cannot be handled at that level and needs to be handled in the
synchronize_irq() context.

Fixes: f8264e3496 ("irqdomain: Introduce new interfaces to support hierarchy irqdomains")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.098196390@linutronix.de
2019-07-03 10:12:28 +02:00
Thomas Gleixner
7e8e6816c6 stacktrace: Use PF_KTHREAD to check for kernel threads
!current->mm is not a reliable indicator for kernel threads as they might
temporarily use a user mm. Check for PF_KTHREAD instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907021750100.1802@nanos.tec.linutronix.de
2019-07-03 09:04:06 +02:00
Jason Gunthorpe
cc5dfd59e3 Merge branch 'hmm-devmem-cleanup.4' into rdma.git hmm
Christoph Hellwig says:

====================
Below is a series that cleans up the dev_pagemap interface so that it is
more easily usable, which removes the need to wrap it in hmm and thus
allowing to kill a lot of code

Changes since v3:
 - pull in "mm/swap: Fix release_pages() when releasing devmap pages" and
   rebase the other patches on top of that
 - fold the hmm_devmem_add_resource into the DEVICE_PUBLIC memory removal
   patch
 - remove _vm_normal_page as it isn't needed without DEVICE_PUBLIC memory
 - pick up various ACKs

Changes since v2:
 - fix nvdimm kunit build
 - add a new memory type for device dax
 - fix a few issues in intermediate patches that didn't show up in the end
   result
 - incorporate feedback from Michal Hocko, including killing of
   the DEVICE_PUBLIC memory type entirely

Changes since v1:
 - rebase
 - also switch p2pdma to the internal refcount
 - add type checking for pgmap->type
 - rename the migrate method to migrate_to_ram
 - cleanup the altmap_valid flag
 - various tidbits from the reviews
====================

Conflicts resolved by:
 - Keeping Ira's version of the code in swap.c
 - Using the delete for the section in hmm.rst
 - Using the delete for the devmap code in hmm.c and .h

* branch 'hmm-devmem-cleanup.4': (24 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 15:10:45 -03:00
Jason Gunthorpe
9ec3f4cb35 Linux 5.2-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0YK7ceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWfcH/36ep8GZHY9H1ARV
 RJJoGoMnwENoq2o4eKhH3iZgUIGPq2uonazequhePwnIsrOdFGT7AeMHWSW7W0o4
 wNlNFdUrTe0bvU00m+YtDwNIqgNCnFEoUbqn9H+VhAAWpSydKvhh2mlebTFO50KN
 hb9+jh59Q8tbxrQdCuNF6yJATdf4hcj1V/ZZMGgF34kx+dFY4wOooSfu/eaIxXIl
 fBDKN9K4Mmw8HWJvebV+ocOMZ7Zqknt1lbjx69OxpJmgxhb2Ks7heqSZanLTBPBB
 oZxOlEdNPSyOjBQUlsDC2S8VJ7g5gINZk1JcFjByzE7cIPOQ2UXE72R++wwANngm
 SR054NQ=
 =WlA8
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc7' into rdma.git hmm

Required for dependencies in the next patches.
2019-07-02 14:34:43 -03:00
Christoph Hellwig
24917f6b10 memremap: provide an optional internal refcount in struct dev_pagemap
Provide an internal refcounting logic if no ->ref field is provided
in the pagemap passed into devm_memremap_pages so that callers don't
have to reinvent it poorly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
514caf23a7 memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible.  Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
80a72d0af0 memremap: remove the data field in struct dev_pagemap
struct dev_pagemap is always embedded into a containing structure, so
there is no need to an additional private data field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
897e6365cd memremap: add a migrate_to_ram method to struct dev_pagemap_ops
This replaces the hacky ->fault callback, which is currently directly
called from common code through a hmm specific data structure as an
exercise in layering violations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
f6a55e1a3f memremap: lift the devmap_enable manipulation into devm_memremap_pages
Just check if there is a ->page_free operation set and take care of the
static key enable, as well as the put using device managed resources.
Also check that a ->page_free is provided for the pgmaps types that
require it, and check for a valid type as well while we are at it.

Note that this also fixes the fact that hmm never called
dev_pagemap_put_ops and thus would leave the slow path enabled forever,
even after a device driver unload or disable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
d8668bb045 memremap: pass a struct dev_pagemap to ->kill and ->cleanup
Passing the actual typed structure leads to more understandable code
vs just passing the ref member.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
1e240e8d4a memremap: move dev_pagemap callbacks into a separate structure
The dev_pagemap is a growing too many callbacks.  Move them into a
separate ops structure so that they are not duplicated for multiple
instances, and an attacker can't easily overwrite them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
3ed2dcdf54 memremap: validate the pagemap type passed to devm_memremap_pages
Most pgmap types are only supported when certain config options are
enabled.  Check for a type that is valid for the current configuration
before setting up the pagemap.  For this the usage of the 0 type for
device dax gets replaced with an explicit MEMORY_DEVICE_DEVDAX type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
0092908d16 mm: factor out a devm_request_free_mem_region helper
Keep the physical address allocation that hmm_add_device does with the
rest of the resource code, and allow future reuse of it without the hmm
wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christian Brauner
28dd29c06d
fork: return proper negative error code
Make sure to return a proper negative error code from copy_process()
when anon_inode_getfile() fails with CLONE_PIDFD.
Otherwise _do_fork() will not detect an error and get_task_pid() will
operator on a nonsensical pointer:

R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74
08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
FS:  00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
  __do_sys_clone kernel/fork.c:2454 [inline]
  __se_sys_clone kernel/fork.c:2448 [inline]
  __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
Fixes: 6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups")
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-01 16:43:30 +02:00
Jens Axboe
5be1f9d82f Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into for-5.3/block

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
  Linux 5.2-rc6
  Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
  Bluetooth: Fix regression with minimum encryption key size alignment
  tcp: refine memory limit test in tcp_fragment()
  x86/vdso: Prevent segfaults due to hoisted vclock reads
  SUNRPC: Fix a credential refcount leak
  Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
  net :sunrpc :clnt :Fix xps refcount imbalance on the error path
  NFS4: Only set creation opendata if O_CREAT
  ARM: 8867/1: vdso: pass --be8 to linker if necessary
  KVM: nVMX: reorganize initial steps of vmx_set_nested_state
  KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
  habanalabs: use u64_to_user_ptr() for reading user pointers
  nfsd: replace Jeff by Chuck as nfsd co-maintainer
  inet: clear num_timeout reqsk_alloc()
  PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  ...
2019-07-01 08:16:08 -06:00
Prakhar Srivastava
6a31fcd4cf KEXEC: Call ima_kexec_cmdline to measure the boot command line args
During soft reboot(kexec_file_load) boot command line
arguments are not measured.

Call ima hook ima_kexec_cmdline to measure the boot command line
arguments into IMA measurement list.

- call ima_kexec_cmdline from kexec_file_load.
- move the call ima_add_kexec_buffer after the cmdline
args have been measured.

Signed-off-by: Prakhar Srivastava <prsriva02@gmail.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-30 17:54:39 -04:00
Linus Torvalds
7c15f41e87 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
 "Two small changes for the cpu hotplug code:

   - Prevent out of bounds access which actually might crash the machine
     caused by a missing bounds check in the fail injection code

   - Warn about unsupported migitation mode command line arguments to
     make people aware that they typoed the paramater. Not necessarily a
     fix but quite some people tripped over that"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Fix out-of-bounds read when setting fail state
  cpu/speculation: Warn on unsupported mitigations= parameter
2019-06-30 11:19:17 +08:00
Linus Torvalds
57103eb7c6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes, most of them related to bugs perf fuzzing found in the
  x86 code"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/regs: Use PERF_REG_EXTENDED_MASK
  perf/x86: Remove pmu->pebs_no_xmm_regs
  perf/x86: Clean up PEBS_XMM_REGS
  perf/x86/regs: Check reserved bits
  perf/x86: Disable extended registers for non-supported PMUs
  perf/ioctl: Add check for the sample_period value
  perf/core: Fix perf_sample_regs_user() mm check
2019-06-29 19:39:17 +08:00
Linus Torvalds
2407e48606 Power management fix for 5.2-rc7
Avoid skipping bus-level PCI power management during system
 resume for PCIe ports left in D0 during the preceding suspend
 transition on platforms where the power states of those ports
 can change out of the PCI layer's control.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl0XJzcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxylwQAI8owd3eQV6UNDybkT5MiP0lWb9nbl83
 2ouxla+FtAzRFJC0yW4RK86cW4i/Yl8767KV2yqX/69ftmz4XhZBJ63ijKAEoG6o
 tHFyY7twy7Sr0MvPRD9rtjUkmdOx9z0OFKHgLhSzC/V4PvgGZTt+eYBm1Bp3icZp
 ZY9CFx/bSt9tURY//VqXhvBWT6pEpn1B1D7hsiAp041EwhtTONNs7xAa7ucIP+aG
 Ufyb0waVYmiFCX+Lrt/gHzEO2YIpTHIUw3DaMcbR8plHc1gpYtbuZ2ZMScgt2TgL
 f0s7GeMOXtF3sODOd/1mhg127ShWbqUkf8EHDyU3JAWa9aesLr3BoFGtKyAT1rbg
 O9nyJGBGj5ByUNefua0S8+q0kWI2XHdLAQ8CHBlBQx5W1x1Yg2EeV2Kosxjuhfdp
 5K9wFIiPG0F/rtGoAA61dMH9tt87NnY8PgeCyHLFUCoJbhySWr18kwrwrdkimqa5
 9FR8OTa8CHGQ/0bPvw+w8S9FdxiEM6yw4wuMLIy3c+a22+lgIiPvkgqzdsWYULdX
 CrI62jvz5SvoTwK/UEp9PrCnnHbp4crbSp73Vgo1o1bi5eeaaSobRECq+IbN0T3P
 X1H/xn+18mUqmCg4WtDX++14Fe1rMHoe/5CqqE/mp8aCqE9q/3fbAs9INnWJcyrP
 a2O0Wk0jLE76
 =eGQi
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Avoid skipping bus-level PCI power management during system resume for
  PCIe ports left in D0 during the preceding suspend transition on
  platforms where the power states of those ports can change out of the
  PCI layer's control"

* tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
2019-06-29 19:29:45 +08:00
Andrea Arcangeli
1bf4580e00 fork,memcg: alloc_thread_stack_node needs to set tsk->stack
Commit 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") corrected two instances, but there was a third
instance of this bug.

Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
execute free_thread_stack() on a dangling pointer.

Enterprise kernels are compiled with VMAP_STACK=y so this isn't
critical, but custom VMAP_STACK=n builds should have some performance
advantage, with the drawback of risking to fail fork because compaction
didn't succeed.  So as long as VMAP_STACK=n is a supported option it's
worth fixing it upstream.

Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
Fixes: 9b6f7e163c ("mm: rework memcg kernel stack accounting")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Oleg Nesterov
97abc889ee signal: remove the wrong signal_pending() check in restore_user_sigmask()
This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed568 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed568 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Toke Høiland-Jørgensen
0cdbb4b09a devmap: Allow map lookups from eBPF
We don't currently allow lookups into a devmap from eBPF, because the map
lookup returns a pointer directly to the dev->ifindex, which shouldn't be
modifiable from eBPF.

However, being able to do lookups in devmaps is useful to know (e.g.)
whether forwarding to a specific interface is enabled. Currently, programs
work around this by keeping a shadow map of another type which indicates
whether a map index is valid.

Since we now have a flag to make maps read-only from the eBPF side, we can
simply lift the lookup restriction if we make sure this flag is always set.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:09 +02:00
Toke Høiland-Jørgensen
d5df2830ca devmap/cpumap: Use flush list instead of bitmap
The socket map uses a linked list instead of a bitmap to keep track of
which entries to flush. Do the same for devmap and cpumap, as this means we
don't have to care about the map index when enqueueing things into the
map (and so we can cache the map lookup).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:08 +02:00
Toke Høiland-Jørgensen
c8af5cd75e xskmap: Move non-standard list manipulation to helper
Add a helper in list.h for the non-standard way of clearing a list that is
used in xskmap. This makes it easier to reuse it in the other map types,
and also makes sure this usage is not forgotten in any list refactorings in
the future.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:08 +02:00
Eiichi Tsukata
46cc0b4442 tracing/snapshot: Resize spare buffer if size changed
Current snapshot implementation swaps two ring_buffers even though their
sizes are different from each other, that can cause an inconsistency
between the contents of buffer_size_kb file and the current buffer size.

For example:

  # cat buffer_size_kb
  7 (expanded: 1408)
  # echo 1 > events/enable
  # grep bytes per_cpu/cpu0/stats
  bytes: 1441020
  # echo 1 > snapshot             // current:1408, spare:1408
  # echo 123 > buffer_size_kb     // current:123,  spare:1408
  # echo 1 > snapshot             // current:1408, spare:123
  # grep bytes per_cpu/cpu0/stats
  bytes: 1443700
  # cat buffer_size_kb
  123                             // != current:1408

And also, a similar per-cpu case hits the following WARNING:

Reproducer:

  # echo 1 > per_cpu/cpu0/snapshot
  # echo 123 > buffer_size_kb
  # echo 1 > per_cpu/cpu0/snapshot

WARNING:

  WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
  Modules linked in:
  CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
  Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
  RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
  RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
  RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
  RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
  R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
  R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
  FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
  Call Trace:
   ? trace_array_printk_buf+0x140/0x140
   ? __mutex_lock_slowpath+0x10/0x10
   tracing_snapshot_write+0x4c8/0x7f0
   ? trace_printk_init_buffers+0x60/0x60
   ? selinux_file_permission+0x3b/0x540
   ? tracer_preempt_off+0x38/0x506
   ? trace_printk_init_buffers+0x60/0x60
   __vfs_write+0x81/0x100
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   ? __ia32_sys_read+0xb0/0xb0
   ? do_syscall_64+0x1f/0x390
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

This patch adds resize_buffer_duplicate_size() to check if there is a
difference between current/spare buffer sizes and resize a spare buffer
if necessary.

Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: ad909e21bb ("tracing: Add internal tracing_snapshot() functions")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:58:52 -04:00
Takeshi Misawa
d122ed6288 tracing: Fix memory leak in tracing_err_log_open()
When tracing_err_log_open() calls seq_open(), allocated memory is not freed.

kmemleak report:

unreferenced object 0xffff92c0781d1100 (size 128):
  comm "tail", pid 15116, jiffies 4295163855 (age 22.704s)
  hex dump (first 32 bytes):
    00 f0 08 e5 c0 92 ff ff 00 10 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000000d0687d5>] kmem_cache_alloc+0x11f/0x1e0
    [<000000003e3039a8>] seq_open+0x2f/0x90
    [<000000008dd36b7d>] tracing_err_log_open+0x67/0x140
    [<000000005a431ae2>] do_dentry_open+0x1df/0x3a0
    [<00000000a2910603>] vfs_open+0x2f/0x40
    [<0000000038b0a383>] path_openat+0x2e8/0x1690
    [<00000000fe025bda>] do_filp_open+0x9b/0x110
    [<00000000483a5091>] do_sys_open+0x1ba/0x260
    [<00000000c558b5fd>] __x64_sys_openat+0x20/0x30
    [<000000006881ec07>] do_syscall_64+0x5a/0x130
    [<00000000571c2e94>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by calling seq_release() in tracing_err_log_fops.release().

Link: http://lkml.kernel.org/r/20190628105640.GA1863@DESKTOP

Fixes: 8a062902be ("tracing: Add tracing error log")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:57:23 -04:00
Petr Mladek
d5b844a2cf ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
The commit 9f255b632b ("module: Fix livepatch/ftrace module text
permissions race") causes a possible deadlock between register_kprobe()
and ftrace_run_update_code() when ftrace is using stop_machine().

The existing dependency chain (in reverse order) is:

-> #1 (text_mutex){+.+.}:
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       __mutex_lock+0x88/0x908
       mutex_lock_nested+0x32/0x40
       register_kprobe+0x254/0x658
       init_kprobes+0x11a/0x168
       do_one_initcall+0x70/0x318
       kernel_init_freeable+0x456/0x508
       kernel_init+0x22/0x150
       ret_from_fork+0x30/0x34
       kernel_thread_starter+0x0/0xc

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x90c/0xde0
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       cpus_read_lock+0x62/0xd0
       stop_machine+0x2e/0x60
       arch_ftrace_update_code+0x2e/0x40
       ftrace_run_update_code+0x40/0xa0
       ftrace_startup+0xb2/0x168
       register_ftrace_function+0x64/0x88
       klp_patch_object+0x1a2/0x290
       klp_enable_patch+0x554/0x980
       do_one_initcall+0x70/0x318
       do_init_module+0x6e/0x250
       load_module+0x1782/0x1990
       __s390x_sys_finit_module+0xaa/0xf0
       system_call+0xd8/0x2d0

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(text_mutex);
                               lock(cpu_hotplug_lock.rw_sem);
                               lock(text_mutex);
  lock(cpu_hotplug_lock.rw_sem);

It is similar problem that has been solved by the commit 2d1e38f566
("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
To be on the safe side, text_mutex must become a low level lock taken
after cpu_hotplug_lock.rw_sem.

This can't be achieved easily with the current ftrace design.
For example, arm calls set_all_modules_text_rw() already in
ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
This functions is called:

  + outside stop_machine() from ftrace_run_update_code()
  + without stop_machine() from ftrace_module_enable()

Fortunately, the problematic fix is needed only on x86_64. It is
the only architecture that calls set_all_modules_text_rw()
in ftrace path and supports livepatching at the same time.

Therefore it is enough to move text_mutex handling from the generic
kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:

   ftrace_arch_code_modify_prepare()
   ftrace_arch_code_modify_post_process()

This patch basically reverts the ftrace part of the problematic
commit 9f255b632b ("module: Fix livepatch/ftrace module
text permissions race"). And provides x86_64 specific-fix.

Some refactoring of the ftrace code will be needed when livepatching
is implemented for arm or nds32. These architectures call
set_all_modules_text_rw() and use stop_machine() at the same time.

Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com

Fixes: 9f255b632b ("module: Fix livepatch/ftrace module text permissions race")
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
[
  As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
  ftrace_run_update_code() as it is a void function.
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:20:25 -04:00
Ingo Molnar
83086d654d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

 - RCU flavor consolidation cleanups and optmizations
 - Documentation updates
 - Miscellaneous fixes
 - SRCU updates
 - RCU-sync flavor consolidation
 - Torture-test updates
 - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-28 19:46:47 +02:00
Christian Brauner
32fcb426ec
pid: add pidfd_open()
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
2019-06-28 12:17:55 +02:00
Joel Fernandes (Google)
b53b0b9d9a
pidfd: add polling support
This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-28 12:17:55 +02:00
Fuqian Huang
2f02a7ecd5 kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
Use zeroing allocator instead of using allocator
followed with memset with 0

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-28 10:20:39 +02:00
David S. Miller
d96ff269a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.

In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.

The aquantia driver conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27 21:06:39 -07:00
Stanislav Fomichev
0d01da6afc bpf: implement getsockopt and setsockopt hooks
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.

BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
passing them down to the kernel or bypass kernel completely.
BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
kernel returns.
Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.

The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_prog_array_is_empty that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.

The return code of the BPF program is handled as follows:
* 0: EPERM
* 1: success, continue with next BPF program in the cgroup chain

v9:
* allow overwriting setsockopt arguments (Alexei Starovoitov):
  * use set_fs (same as kernel_setsockopt)
  * buffer is always kzalloc'd (no small on-stack buffer)

v8:
* use s32 for optlen (Andrii Nakryiko)

v7:
* return only 0 or 1 (Alexei Starovoitov)
* always run all progs (Alexei Starovoitov)
* use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
  (decided to use optval=-1 instead, optval=0 might be a valid input)
* call getsockopt hook after kernel handlers (Alexei Starovoitov)

v6:
* rework cgroup chaining; stop as soon as bpf program returns
  0 or 2; see patch with the documentation for the details
* drop Andrii's and Martin's Acked-by (not sure they are comfortable
  with the new state of things)

v5:
* skip copy_to_user() and put_user() when ret == 0 (Martin Lau)

v4:
* don't export bpf_sk_fullsock helper (Martin Lau)
* size != sizeof(__u64) for uapi pointers (Martin Lau)
* offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)

v3:
* typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
* reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
  Nakryiko)
* use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
* use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
* new CG_SOCKOPT_ACCESS macro to wrap repeated parts

v2:
* moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
* aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
* bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
* added [0,2] return code check to verifier (Martin Lau)
* dropped unused buf[64] from the stack (Martin Lau)
* use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
* dropped bpf_target_off from ctx rewrites (Martin Lau)
* use return code for kernel bypass (Martin Lau & Andrii Nakryiko)

Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-27 15:25:16 -07:00
Mauro Carvalho Chehab
516337048f hrtimer: Use a bullet for the returns bullet list
That gets rid of this warning:

   ./kernel/time/hrtimer.c:1119: WARNING: Block quote ends without a blank line; unexpected unindent.

and displays nicely both at the source code and at the produced
documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lkml.kernel.org/r/74ddad7dac331b4e5ce4a90e15c8a49e3a16d2ac.1561372382.git.mchehab+samsung@kernel.org
2019-06-27 23:30:04 +02:00
Thomas Gleixner
be69d00d97 workqueue: Remove GPF argument from alloc_workqueue_attrs()
All callers use GFP_KERNEL. No point in having that argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:19 -07:00
Thomas Gleixner
2c9858ecbe workqueue: Make alloc/apply/free_workqueue_attrs() static
None of those functions have any users outside of workqueue.c. Confine
them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:15 -07:00
Roman Gushchin
e5c891a349 bpf: fix cgroup bpf release synchronization
Since commit 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf
from cgroup itself"), cgroup_bpf release occurs asynchronously
(from a worker context), and before the release of the cgroup itself.

This introduced a previously non-existing race between the release
and update paths. E.g. if a leaf's cgroup_bpf is released and a new
bpf program is attached to the one of ancestor cgroups at the same
time. The race may result in double-free and other memory corruptions.

To fix the problem, let's protect the body of cgroup_bpf_release()
with cgroup_mutex, as it was effectively previously, when all this
code was called from the cgroup release path with cgroup mutex held.

Also let's skip cgroups, which have no chances to invoke a bpf
program, on the update path. If the cgroup bpf refcnt reached 0,
it means that the cgroup is offline (no attached processes), and
there are no associated sockets left. It means there is no point
in updating effective progs array! And it can lead to a leak,
if it happens after the release. So, let's skip such cgroups.

Big thanks for Tejun Heo for discovering and debugging of this problem!

Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:51:58 +02:00
Al Viro
6fd2fe494b
copy_process(): don't use ksys_close() on cleanups
anon_inode_getfd() should be used *ONLY* in situations when we are
guaranteed to be past the last failure point (including copying the
descriptor number to userland, at that).  And ksys_close() should
not be used for cleanups at all.

anon_inode_getfile() is there for all nontrivial cases like that.
Just use that...

Fixes: b3e5838252 ("clone: add CLONE_PIDFD")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-27 12:24:03 +02:00
Eiichi Tsukata
33d4a5a7a5 cpu/hotplug: Fix out-of-bounds read when setting fail state
Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
can control `struct cpuhp_step *sp` address, results in the following
global-out-of-bounds read.

Reproducer:

  # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail

KASAN report:

  BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
  Read of size 8 at addr ffffffff89734438 by task bash/1941

  CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
  Call Trace:
   write_cpuhp_fail+0x2cd/0x2e0
   dev_attr_store+0x58/0x80
   sysfs_kf_write+0x13d/0x1a0
   kernfs_fop_write+0x2bc/0x460
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7f05e4f4c970

  The buggy address belongs to the variable:
   cpu_hotplug_lock+0x98/0xa0

  Memory state around the buggy address:
   ffffffff89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
   ffffffff89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  >ffffffff89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
                                          ^
   ffffffff89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffffffff89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Add a sanity check for the value written from user space.

Fixes: 1db49484f2 ("smp/hotplug: Hotplug state fail injection")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/20190627024732.31672-1-devel@etsukata.com
2019-06-27 09:34:04 +02:00
Al Viro
02e5ad9738 perf_event_get(): don't bother with fget_raw()
... since we immediately follow that with check that it *is* an
opened perf file, with O_PATH ones ending with with the same
-EBADF we'd get for descriptor that isn't opened at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-26 20:43:53 -04:00
Rafael J. Wysocki
471a739a47 PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
There are platforms that do not call pm_set_suspend_via_firmware(),
so pm_suspend_via_firmware() returns 'false' on them, but the power
states of PCI devices (PCIe ports in particular) are changed as a
result of powering down core platform components during system-wide
suspend.  Thus the pm_suspend_via_firmware() checks in
pci_pm_suspend_noirq() and pci_pm_resume_noirq() introduced by
commit 3e26c5feed ("PCI: PM: Skip devices in D0 for suspend-to-
idle") are not sufficient to determine that devices left in D0
during suspend will remain in D0 during resume and so the bus-level
power management can be skipped for them.

For this reason, introduce a new global suspend flag,
PM_SUSPEND_FLAG_NO_PLATFORM, set it for suspend-to-idle only
and replace the pm_suspend_via_firmware() checks mentioned above
with checks against this flag.

Fixes: 3e26c5feed ("PCI: PM: Skip devices in D0 for suspend-to-idle")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-06-26 23:51:56 +02:00
David Howells
0f44e4d976 keys: Move the user and user-session keyrings to the user_namespace
Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct.  This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.

The issue is that a single user_struct may be represent UIDs in several
different namespaces.

The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that.  It can then be searched to retrieve them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jann Horn <jannh@google.com>
2019-06-26 21:02:32 +01:00
David Howells
b206f281d0 keys: Namespace keyring names
Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission).  This isn't very container friendly, however.

Make the following changes:

 (1) Make default session, process and thread keyring names begin with a
     '.' instead of '_'.

 (2) Keyrings whose names begin with a '.' aren't added to the list.  Such
     keyrings are system specials.

 (3) Replace the global list with per-user_namespace lists.  A keyring adds
     its name to the list for the user_namespace that it is currently in.

 (4) When a user_namespace is deleted, it just removes itself from the
     keyring name list.

The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.

This can be tested by:

	# keyctl newring foo @s
	995906392
	# unshare -U
	$ keyctl show
	...
	 995906392 --alswrv  65534 65534   \_ keyring: foo
	...
	$ keyctl session foo
	Joined session keyring: 935622349

As can be seen, a new session keyring was created.

The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
2019-06-26 21:02:32 +01:00
Yang Yingliang
93651f80dc modules: fix compile error if don't have strict module rwx
If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is not defined,
we need stub for module_enable_nx() and module_enable_x().

If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is defined, but
CONFIG_STRICT_MODULE_RWX is disabled, we need stub for
module_enable_nx.

Move frob_text() outside of the CONFIG_STRICT_MODULE_RWX,
because it is needed anyway.

Fixes: 2eef1399a8 ("modules: fix BUG when load module with rodata=n")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-26 19:27:59 +02:00
Geert Uytterhoeven
1bf7272028 cpu/speculation: Warn on unsupported mitigations= parameter
Currently, if the user specifies an unsupported mitigation strategy on the
kernel command line, it will be ignored silently.  The code will fall back
to the default strategy, possibly leaving the system more vulnerable than
expected.

This may happen due to e.g. a simple typo, or, for a stable kernel release,
because not all mitigation strategies have been backported.

Inform the user by printing a message.

Fixes: 98af845294 ("cpu/speculation: Add 'mitigations=' cmdline option")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.org
2019-06-26 16:56:21 +02:00
Jiong Wang
75672dda27 bpf: fix BPF_ALU32 | BPF_ARSH on BE arches
Yauheni reported the following code do not work correctly on BE arches:

       ALU_ARSH_X:
               DST = (u64) (u32) ((*(s32 *) &DST) >> SRC);
               CONT;
       ALU_ARSH_K:
               DST = (u64) (u32) ((*(s32 *) &DST) >> IMM);
               CONT;

and are causing failure of test_verifier test 'arsh32 on imm 2' on BE
arches.

The code is taking address and interpreting memory directly, so is not
endianness neutral. We should instead perform standard C type casting on
the variable. A u64 to s32 conversion will drop the high 32-bit and reserve
the low 32-bit as signed integer, this is all we want.

Fixes: 2dc6b100f9 ("bpf: interpreter support BPF_ALU | BPF_ARSH")
Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 14:48:20 +02:00
Yonghong Song
9db1ff0a41 bpf: fix compiler warning with CONFIG_MODULES=n
With CONFIG_MODULES=n, the following compiler warning occurs:
  /data/users/yhs/work/net-next/kernel/trace/bpf_trace.c:605:13: warning:
      ‘do_bpf_send_signal’ defined but not used [-Wunused-function]
  static void do_bpf_send_signal(struct irq_work *entry)

The __init function send_signal_irq_work_init(), which calls
do_bpf_send_signal(), is defined under CONFIG_MODULES. Hence,
when CONFIG_MODULES=n, nobody calls static function do_bpf_send_signal(),
hence the warning.

The init function send_signal_irq_work_init() should work without
CONFIG_MODULES. Moving it out of CONFIG_MODULES
code section fixed the compiler warning, and also make bpf_send_signal()
helper work without CONFIG_MODULES.

Fixes: 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
Reported-By: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 14:44:07 +02:00
Christoph Hellwig
d98849aff8 dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code
DMA_ATTR_NO_KERNEL_MAPPING is generally implemented by allocating
normal cacheable pages or CMA memory, and then returning the page
pointer as the opaque handle.  Lift that code from the xtensa and
generic dma remapping implementations into the generic dma-direct
code so that we don't even call arch_dma_alloc for these allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 14:28:05 +02:00
Christoph Hellwig
c2f2124e0d dma-direct: handle DMA_ATTR_NON_CONSISTENT in common code
Only call into arch_dma_alloc if we require an uncached mapping,
and remove the parisc code manually doing normal cached
DMA_ATTR_NON_CONSISTENT allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
2019-06-25 14:27:58 +02:00
Toshiaki Makita
e7d4798960 xdp: Add tracepoint for bulk XDP_TX
This is introduced for admins to check what is happening on XDP_TX when
bulk XDP_TX is in use, which will be first introduced in veth in next
commit.

v3:
- Add act field to be in line with other XDP tracepoints.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-25 14:26:50 +02:00
Kobe Wu
9156e54576 locking/lockdep: increase size of counters for lockdep statistics
When system has been running for a long time, signed integer
counters are not enough for some lockdep statistics. Using
unsigned long counters can satisfy the requirement. Besides,
most of lockdep statistics are unsigned. It is better to use
unsigned int instead of int.

Remove unused variables.
- max_recursion_depth
- nr_cyclic_check_recursions
- nr_find_usage_forwards_recursions
- nr_find_usage_backwards_recursions

Signed-off-by: Kobe Wu <kobe-cp.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Eason Lin <eason-yh.lin@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/1561365348-16050-1-git-send-email-kobe-cp.wu@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-25 10:17:08 +02:00
Arnd Bergmann
886532aee3 locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
The last cleanup patch triggered another issue, as now another function
should be moved into the same section:

 kernel/locking/lockdep.c:3580:12: error: 'mark_lock' defined but not used [-Werror=unused-function]
  static int mark_lock(struct task_struct *curr, struct held_lock *this,

Move mark_lock() into the same #ifdef section as its only caller, and
remove the now-unused mark_lock_irq() stub helper.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Fixes: 0d2cc3b345 ("locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190617124718.1232976-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-25 10:17:07 +02:00
Christoph Hellwig
4b85faed21 dma-mapping: add a dma_alloc_need_uncached helper
Check if we need to allocate uncached memory for a device given the
allocation flags.  Switch over the uncached segment check to this helper
to deal with architectures that do not support the dma_cache_sync
operation and thus should not returned cacheable memory for
DMA_ATTR_NON_CONSISTENT allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 08:14:43 +02:00
Christoph Hellwig
4a54d16f61 dma-mapping: truncate dma masks to what dma_addr_t can hold
The dma masks in struct device are always 64-bits wide.  But for builds
using a 32-bit dma_addr_t we need to ensure we don't store an
unsupportable value.  Before Linux 5.0 this was handled at least by
the ARM dma mapping code by never allowing to set a larger dma_mask,
but these days we allow the driver to just set the largest supported
value and never fall back to a smaller one.  Ensure this always works
by truncating the value.

Fixes: 9eb9e96e97 ("Documentation/DMA-API-HOWTO: update dma_mask sections")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 07:54:06 +02:00
Ian Rogers
fd7d55172d perf/cgroups: Don't rotate events for cgroups unnecessarily
Currently perf_rotate_context assumes that if the context's nr_events !=
nr_active a rotation is necessary for perf event multiplexing. With
cgroups, nr_events is the total count of events for all cgroups and
nr_active will not include events in a cgroup other than the current
task's. This makes rotation appear necessary for cgroups when it is not.

Add a perf_event_context flag that is set when rotation is necessary.
Clear the flag during sched_out and set it when a flexible sched_in
fails due to resources.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:30:04 +02:00
Ingo Molnar
b9271f0c65 Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into perf/core, to refresh branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:25:52 +02:00
Patrick Bellasi
af24bde8df sched/uclamp: Add uclamp support to energy_compute()
The Energy Aware Scheduler (EAS) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:

 a) an (active) power consumption defined for each CPU frequency
 b) an estimation of which frequency will be used on each CPU
 c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c).

A CPU is expected to run:

 - on an higher than required frequency, but for a shorter time, in case
   its estimated utilization will be smaller than the minimum utilization
   enforced by uclamp
 - on a smaller than required frequency, but for a longer time, in case
   its estimated utilization is bigger than the maximum utilization
   enforced by uclamp

While compute_energy() already accounts clamping effects on busy time,
the clamping effects on frequency selection are currently ignored.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of the cpu_util signal implies that we are more likely to
estimate the highest OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:

 - it's not easy to say which approach is better, since it depends on
   the use case
 - the original approach could still be obtained by setting a smaller
   task-specific util_min whenever required

Since we are at that:

 - rename schedutil_freq_util() into schedutil_cpu_util(),
   since it's not only used for frequency selection.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-12-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:49 +02:00
Patrick Bellasi
9d20ad7dfc sched/uclamp: Add uclamp_util_with()
So far uclamp_util() allows to clamp a specified utilization considering
the clamp values requested by RUNNABLE tasks in a CPU. For the Energy
Aware Scheduler (EAS) it is interesting to test how clamp values will
change when a task is becoming RUNNABLE on a given CPU.
For example, EAS is interested in comparing the energy impact of
different scheduling decisions and the clamp values can play a role on
that.

Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:48 +02:00
Patrick Bellasi
982d9cdc22 sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints.  This will allow, for example, to:

 - boost tasks which are directly affecting the user experience
   by running them at least at a minimum "requested" frequency

 - cap low priority tasks not directly affecting the user experience
   by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:48 +02:00
Patrick Bellasi
1a00d99997 sched/uclamp: Set default clamps for RT tasks
By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand.  This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

 1. the task is switched to the RT class and it does not already have a
    user-defined clamp value assigned.

 2. an RT task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-9-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:47 +02:00
Patrick Bellasi
a87498ace5 sched/uclamp: Reset uclamp values on RESET_ON_FORK
A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:

   sys_sched_setattr()
      sched_setattr()
         __sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)

the new forked task is expected to start with all attributes reset to
default values.

Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-8-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:47 +02:00
Patrick Bellasi
a509a7cd79 sched/uclamp: Extend sched_setattr() to support utilization clamping
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.

Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.

Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).

Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.

Do that by validating the required clamp values before and then applying
the required changes using _the_ same pattern already in use for
__setscheduler(). This ensures that the task is re-enqueued with the new
clamp values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:46 +02:00
Patrick Bellasi
1d6362fa0c sched/core: Allow sched_setattr() to use the current policy
The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:46 +02:00
Patrick Bellasi
e8f14172c6 sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

  /proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi
e496187da7 sched/uclamp: Enforce last task's UCLAMP_MAX
When a task sleeps it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.

Fix this by using:

  uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
    uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.

Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi
60daf9c194 sched/uclamp: Add bucket local max tracking
Because of bucketization, different task-specific clamp values are
tracked in the same bucket.  For example, with 20% bucket size and
assuming to have:

  Task1: util_min=25%
  Task2: util_min=35%

both tasks will be refcounted in the [20..39]% bucket and always boosted
only up to 20% thus implementing a simple floor aggregation normally
used in histograms.

In systems with only few and well-defined clamp values, it would be
useful to track the exact clamp value required by a task whenever
possible. For example, if a system requires only 23% and 47% boost
values then it's possible to track the exact boost required by each
task using only 3 buckets of ~33% size each.

Introduce a mechanism to max aggregate the requested clamp values of
RUNNABLE tasks in the same bucket. Keep it simple by resetting the
bucket value to its base value only when a bucket becomes inactive.
Allow a limited and controlled overboosting margin for tasks recounted
in the same bucket.

In systems where the boost values are not known in advance, it is still
possible to control the maximum acceptable overboosting margin by tuning
the number of clamp groups. For example, 20 groups ensure a 5% maximum
overboost.

Remove the rq bucket initialization code since a correct bucket value
is now computed when a task is refcounted into a CPU's rq.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Patrick Bellasi
69842cba9a sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Dietmar Eggemann
a3df067974 sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
The term 'weighted' is not needed since there is no 'unweighted' load.
Instead use the term 'runnable' to distinguish 'runnable' load
(avg.runnable_load_avg) used in load balance from load (avg.load_avg)
which is the sum of 'runnable' and 'blocked' load.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/57f27a7f-2775-d832-e965-0f4d51bb1954@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:43 +02:00
Qais Yousef
a056a5bed7 sched/debug: Export the newly added tracepoints
So that external modules can hook into them and extract the info they
need. Since these new tracepoints have no events associated with them
exporting these tracepoints make them useful for external modules to
perform testing and debugging. There's no other way otherwise to access
them.

BPF doesn't have infrastructure to access these bare tracepoints either.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-7-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:43 +02:00
Qais Yousef
f9f240f96e sched/debug: Add sched_overutilized tracepoint
The new tracepoint allows us to track the changes in overutilized
status.

Overutilized status is associated with EAS. It indicates that the system
is in high performance state. EAS is disabled when the system is in this
state since there's not much energy savings while high performance tasks
are pushing the system to the limit and it's better to default to the
spreading behavior of the scheduler.

This tracepoint helps understanding and debugging the conditions under
which this happens.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-6-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:42 +02:00
Qais Yousef
8de6242cca sched/debug: Add new tracepoint to track PELT at se level
The new tracepoint allows tracking PELT signals at sched_entity level.
Which is supported in CFS tasks and taskgroups only.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-5-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:42 +02:00
Qais Yousef
ba19f51fcb sched/debug: Add new tracepoints to track PELT at rq level
The new tracepoints allow tracking PELT signals at rq level for all
scheduling classes + irq.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-4-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:41 +02:00
Qais Yousef
3c93a0c04d sched/debug: Add a new sched_trace_*() helper functions
The new functions allow modules to access internal data structures of
unexported struct cfs_rq and struct rq to extract important information
from the tracepoints to be introduced in later patches.

While at it fix alphabetical order of struct declarations in sched.h

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-3-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:41 +02:00
Qais Yousef
9ba5090aec sched/autogroup: Make autogroup_path() always available
Remove the #ifdef CONFIG_SCHED_DEBUG.

Some of the tracepoints to be introduced in later patches need to access
this function. Hence make it always available since the tracepoints are
not protected by CONFIG_SCHED_DEBUG.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-2-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:40 +02:00
Pavel Begunkov
016190a4b5 sched/wait: Deduplicate code with do-while
Statements in the loop's body and before it are identical.
Use do-while to not repeat it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/43ffea6ee2152b90dedf962eac851609e4197218.1560256112.git.asml.silence@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:40 +02:00
Vincent Guittot
8ec59c0f5f sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
unused since commit:

  765d0af19f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

Remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: linux@armlinux.org.uk
Cc: quentin.perret@arm.com
Cc: rafael@kernel.org
Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:39 +02:00
Ingo Molnar
d2abae71eb Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into sched/core, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:53 +02:00
Kan Liang
e321d02db8 perf/x86: Disable extended registers for non-supported PMUs
The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea27 ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:23 +02:00
Ravi Bangoria
913a90bc5a perf/ioctl: Add check for the sample_period value
perf_event_open() limits the sample_period to 63 bits. See:

  0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")

Make ioctl() consistent with it.

Also on PowerPC, negative sample_period could cause a recursive
PMIs leading to a hang (reported when running perf-fuzzer).

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: maddy@linux.vnet.ibm.com
Cc: mpe@ellerman.id.au
Fixes: 0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")
Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:22 +02:00
Stanislav Fomichev
e4f0712021 bpf: fix NULL deref in btf_type_is_resolve_source_only
Commit 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
added invocations of btf_type_is_resolve_source_only before
btf_type_nosize_or_null which checks for the NULL pointer.
Swap the order of btf_type_nosize_or_null and
btf_type_is_resolve_source_only to make sure the do the NULL pointer
check first.

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-24 15:53:19 +02:00
Dmitry V. Levin
9014143bab
fork: don't check parent_tidptr with CLONE_PIDFD
Give userspace a cheap and reliable way to tell whether CLONE_PIDFD is
supported by the kernel or not. The easiest way is to pass an invalid
file descriptor value in parent_tidptr, perform the syscall and verify
that parent_tidptr has been changed to a valid file descriptor value.

CLONE_PIDFD uses parent_tidptr to return pidfds. CLONE_PARENT_SETTID
will use parent_tidptr to return the tid of the parent. The two flags
cannot be used together. Old kernels that only support
CLONE_PARENT_SETTID will not verify the value pointed to by
parent_tidptr. This behavior is unchanged even with the introduction of
CLONE_PIDFD.
However, if CLONE_PIDFD is specified the kernel will currently check the
value pointed to by parent_tidptr before placing the pidfd in the memory
pointed to. EINVAL will be returned if the value in parent_tidptr is not
0.

If CLONE_PIDFD is supported and fd 0 is closed, then the returned pidfd
can and likely will be 0 and parent_tidptr will be unchanged. This means
userspace must either check CLONE_PIDFD support beforehand or check that
fd 0 is not closed when invoking CLONE_PIDFD.

The check for pidfd == 0 was introduced during the v5.2 merge window by
commit b3e5838252 ("clone: add CLONE_PIDFD") to ensure that
CLONE_PIDFD could be potentially extended by passing in flags through
the return argument.

However, that extension would look horrible, and with the upcoming
introduction of the clone3 syscall in v5.3 there is no need to extend
legacy clone syscall this way. (Even if it would need to be extended,
CLONE_DETACHED can be reused with CLONE_PIDFD.)

So remove the pidfd == 0 check. Userspace that needs to be portable to
kernels without CLONE_PIDFD support can then be advised to initialize
pidfd to -1 and check the pidfd value returned by CLONE_PIDFD.

Fixes: b3e5838252 ("clone: add CLONE_PIDFD")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-24 15:52:54 +02:00
Matthias Schiffer
38b37d631a module: allow arch overrides for .exit section names
Some archs like ARM store unwind information for .exit.text in sections
with unusual names. As this unwind information refers to .exit.text, it
must not be loaded when .exit.text is not loaded (when CONFIG_MODULE_UNLOAD
is unset); otherwise, loading a module can fail due to relocation failures.

Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-24 14:00:32 +02:00
Yang Yingliang
2eef1399a8 modules: fix BUG when load module with rodata=n
When loading a module with rodata=n, it causes an executing
NX-protected page BUG.

[   32.379191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   32.382917] BUG: unable to handle page fault for address: ffffffffc0005000
[   32.385947] #PF: supervisor instruction fetch in kernel mode
[   32.387662] #PF: error_code(0x0011) - permissions violation
[   32.389352] PGD 240c067 P4D 240c067 PUD 240e067 PMD 421a52067 PTE 8000000421a53063
[   32.391396] Oops: 0011 [#1] SMP PTI
[   32.392478] CPU: 7 PID: 2697 Comm: insmod Tainted: G           O      5.2.0-rc5+ #202
[   32.394588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[   32.398157] RIP: 0010:ko_test_init+0x0/0x1000 [ko_test]
[   32.399662] Code: Bad RIP value.
[   32.400621] RSP: 0018:ffffc900029f3ca8 EFLAGS: 00010246
[   32.402171] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   32.404332] RDX: 00000000000004c7 RSI: 0000000000000cc0 RDI: ffffffffc0005000
[   32.406347] RBP: ffffffffc0005000 R08: ffff88842fbebc40 R09: ffffffff810ede4a
[   32.408392] R10: ffffea00108e3480 R11: 0000000000000000 R12: ffff88842bee21a0
[   32.410472] R13: 0000000000000001 R14: 0000000000000001 R15: ffffc900029f3e78
[   32.412609] FS:  00007fb4f0c0a700(0000) GS:ffff88842fbc0000(0000) knlGS:0000000000000000
[   32.414722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.416290] CR2: ffffffffc0004fd6 CR3: 0000000421a90004 CR4: 0000000000020ee0
[   32.418471] Call Trace:
[   32.419136]  do_one_initcall+0x41/0x1df
[   32.420199]  ? _cond_resched+0x10/0x40
[   32.421433]  ? kmem_cache_alloc_trace+0x36/0x160
[   32.422827]  do_init_module+0x56/0x1f7
[   32.423946]  load_module+0x1e67/0x2580
[   32.424947]  ? __alloc_pages_nodemask+0x150/0x2c0
[   32.426413]  ? map_vm_area+0x2d/0x40
[   32.427530]  ? __vmalloc_node_range+0x1ef/0x260
[   32.428850]  ? __do_sys_init_module+0x135/0x170
[   32.430060]  ? _cond_resched+0x10/0x40
[   32.431249]  __do_sys_init_module+0x135/0x170
[   32.432547]  do_syscall_64+0x43/0x120
[   32.433853]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Because if rodata=n, set_memory_x() can't be called, fix this by
calling set_memory_x in complete_formation();

Fixes: f2c65fb322 ("x86/modules: Avoid breaking W^X while loading modules")
Suggested-by: Jian Cheng <cj.chengjian@huawei.com>
Reviewed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-24 12:11:41 +02:00
Muchun Song
8afecaa68d softirq: Use __this_cpu_write() in takeover_tasklets()
The code is executed with interrupts disabled, so it's safe to use
__this_cpu_write().

[ tglx: Massaged changelog ]

Signed-off-by: Muchun Song <smuchun@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: joel@joelfernandes.org
Cc: rostedt@goodmis.org
Cc: frederic@kernel.org
Cc: paulmck@linux.vnet.ibm.com
Cc: alexander.levin@verizon.com
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/20190618143305.2038-1-smuchun@gmail.com
2019-06-23 18:14:27 +02:00
Nadav Amit
caa759323c smp: Remove smp_call_function() and on_each_cpu() return values
The return value is fixed. Remove it and amend the callers.

[ tglx: Fixup arm/bL_switcher and powerpc/rtas ]

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20190613064813.8102-2-namit@vmware.com
2019-06-23 14:26:26 +02:00
Nadav Amit
a22793c79d smp: Do not mark call_function_data as shared
cfd_data is marked as shared, but although it hold pointers to shared
data structures, it is private per core.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190613064813.8102-8-namit@vmware.com
2019-06-23 14:26:25 +02:00
Nathan Huckleberry
a9314773a9 timer_list: Guard procfs specific code
With CONFIG_PROC_FS=n the following warning is emitted:

kernel/time/timer_list.c:361:36: warning: unused variable
'timer_list_sops' [-Wunused-const-variable]
   static const struct seq_operations timer_list_sops = {

Add #ifdef guard around procfs specific code.

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: clang-built-linux@googlegroups.com
Link: https://github.com/ClangBuiltLinux/linux/issues/534
Link: https://lkml.kernel.org/r/20190614181604.112297-1-nhuck@google.com
2019-06-23 00:08:52 +02:00
Vincenzo Frascino
44f57d788e timekeeping: Provide a generic update_vsyscall() implementation
The new generic VDSO library allows to unify the update_vsyscall[_tz]()
implementations.

Provide a generic implementation based on the x86 code and the bindings
which need to be implemented in architecture specific code.

[ tglx: Moved it into kernel/time where it belongs. Removed the pointless
  	line breaks in the stub functions. Massaged changelog ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shijith Thotton <sthotton@marvell.com>
Tested-by: Andre Przywara <andre.przywara@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-4-vincenzo.frascino@arm.com
2019-06-22 21:21:06 +02:00
David S. Miller
92ad6325cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor SPDX change conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 08:59:24 -04:00
Sebastian Andrzej Siewior
7586addb99 posix-timers: Use spin_lock_irq() in itimer_delete()
itimer_delete() uses spin_lock_irqsave() to obtain a `flags' variable
which can then be passed to unlock_timer(). It uses already spin_lock
locking for the structure instead of lock_timer() because it has a timer
which can not be removed by others at this point. The cleanup is always
performed with enabled interrupts.

Use spin_lock_irq() / spin_unlock_irq() so the `flags' variable can be
removed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190621143643.25649-3-bigeasy@linutronix.de
2019-06-22 12:14:22 +02:00
Sebastian Andrzej Siewior
12063d4310 posix-timers: Remove "it_signal = NULL" assignment in itimer_delete()
itimer_delete() is invoked during do_exit(). At this point it is the
last thread in the group dying and doing the clean up.
Since it is the last thread in the group, there can not be any other
task attempting to lock the itimer which means the NULL assignment (which
avoids lookups in __lock_timer()) is not required.

The assignment and comment was copied in commit 0e568881178ff ("[PATCH]
fix posix-timers to have proper per-process scope") from
sys_timer_delete() which was/is the syscall interface and requires the
assignment.

Remove the superfluous ->it_signal = NULL assignment.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190621143643.25649-2-bigeasy@linutronix.de
2019-06-22 12:14:22 +02:00
Jason A. Donenfeld
9285ec4c8b timekeeping: Use proper clock specifier names in functions
This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Jason A. Donenfeld
0354c1a3cd timekeeping: Use proper ktime_add when adding nsecs in coarse offset
While this doesn't actually amount to a real difference, since the macro
evaluates to the same thing, every place else operates on ktime_t using
these functions, so let's not break the pattern.

Fixes: e3ff9c3678 ("timekeeping: Repair ktime_get_coarse*() granularity")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-1-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Thomas Gleixner
6808acb57a Merge branch 'linus' into timers/core
Pick up upstream fixes for pending changes.
2019-06-22 12:07:35 +02:00
Miroslav Lichvar
d897a4ab11 ntp: Limit TAI-UTC offset
Don't allow the TAI-UTC offset of the system clock to be set by adjtimex()
to a value larger than 100000 seconds.

This prevents an overflow in the conversion to int, prevents the CLOCK_TAI
clock from getting too far ahead of the CLOCK_REALTIME clock, and it is
still large enough to allow leap seconds to be inserted at the maximum rate
currently supported by the kernel (once per day) for the next ~270 years,
however unlikely it is that someone can survive a catastrophic event which
slowed down the rotation of the Earth so much.

Reported-by: Weikang shi <swkhack@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190618154713.20929-1-mlichvar@redhat.com
2019-06-22 11:28:53 +02:00
Linus Torvalds
c884d8ac7f SPDX update for 5.2-rc6
Another round of SPDX updates for 5.2-rc6
 
 Here is what I am guessing is going to be the last "big" SPDX update for
 5.2.  It contains all of the remaining GPLv2 and GPLv2+ updates that
 were "easy" to determine by pattern matching.  The ones after this are
 going to be a bit more difficult and the people on the spdx list will be
 discussing them on a case-by-case basis now.
 
 Another 5000+ files are fixed up, so our overall totals are:
 	Files checked:            64545
 	Files with SPDX:          45529
 
 Compared to the 5.1 kernel which was:
 	Files checked:            63848
 	Files with SPDX:          22576
 This is a huge improvement.
 
 Also, we deleted another 20000 lines of boilerplate license crud, always
 nice to see in a diffstat.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
 kbsjBGEJpLbMRB2krnaU
 =RMcT
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull still more SPDX updates from Greg KH:
 "Another round of SPDX updates for 5.2-rc6

  Here is what I am guessing is going to be the last "big" SPDX update
  for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
  that were "easy" to determine by pattern matching. The ones after this
  are going to be a bit more difficult and the people on the spdx list
  will be discussing them on a case-by-case basis now.

  Another 5000+ files are fixed up, so our overall totals are:
	Files checked:            64545
	Files with SPDX:          45529

  Compared to the 5.1 kernel which was:
	Files checked:            63848
	Files with SPDX:          22576

  This is a huge improvement.

  Also, we deleted another 20000 lines of boilerplate license crud,
  always nice to see in a diffstat"

* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
  ...
2019-06-21 09:58:42 -07:00
Julien Thierry
17ce302f31 arm64: Fix interrupt tracing in the presence of NMIs
In the presence of any form of instrumentation, nmi_enter() should be
done before calling any traceable code and any instrumentation code.

Currently, nmi_enter() is done in handle_domain_nmi(), which is much
too late as instrumentation code might get called before. Move the
nmi_enter/exit() calls to the arch IRQ vector handler.

On arm64, it is not possible to know if the IRQ vector handler was
called because of an NMI before acknowledging the interrupt. However, It
is possible to know whether normal interrupts could be taken in the
interrupted context (i.e. if taking an NMI in that context could
introduce a potential race condition).

When interrupting a context with IRQs disabled, call nmi_enter() as soon
as possible. In contexts with IRQs enabled, defer this to the interrupt
controller, which is in a better position to know if an interrupt taken
is an NMI.

Fixes: bc3c03ccb4 ("arm64: Enable the support of pseudo-NMIs")
Cc: <stable@vger.kernel.org> # 5.1.x-
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-21 15:49:58 +01:00
Christoph Hellwig
474a280036 cgroup: export css_next_descendant_pre for bfq
The bfq schedule now uses css_next_descendant_pre directly after
the stats functionality depending on it has been from the core
blk-cgroup code to bfq.  Export the symbol so that bfq can still
be build modular.

Fixes: d6258980da ("bfq-iosched: move bfq_stat_recursive_sum into the only caller")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-21 02:48:34 -06:00
Christian Brauner
d68dbb0c9a
arch: handle arches who do not yet define clone3
This cleanly handles arches who do not yet define clone3.

clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
assumption that this would cleanly handle all architectures. It does
not.
Architectures such as nios2 or h8300 simply take the asm-generic syscall
definitions and generate their syscall table from it. Since they don't
define __ARCH_WANT_SYS_CLONE the build would fail complaining about
sys_clone3 missing. The reason this doesn't happen for legacy clone is
that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
be done for architectural reasons.

The build failures for nios2 and h8300 were caught int -next luckily.
The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
add. Additionally, we need a cond_syscall(clone3) for architectures such
as nios2 or h8300 that generate their syscall table in the way I
explained above.

Fixes: 8f3220a806 ("arch: wire-up clone3() syscall")
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
2019-06-21 01:54:53 +02:00
Petr Mladek
ac59a471e9 livepatch: Remove duplicate warning about missing reliable stacktrace support
WARN_ON_ONCE() could not be called safely under rq lock because
of console deadlock issues. Moreover WARN_ON_ONCE() is superfluous in
klp_check_stack(), because stack_trace_save_tsk_reliable() cannot return
-ENOSYS thanks to klp_have_reliable_stack() check in
klp_try_switch_task().

[ mbenes: changelog edited ]
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 16:19:13 +02:00
Miroslav Benes
67059d65f7 Revert "livepatch: Remove reliable stacktrace check in klp_try_switch_task()"
This reverts commit 1d98a69e5c. Commit
31adf2308f ("livepatch: Convert error about unsupported reliable
stacktrace into a warning") weakened the enforcement for architectures
to have reliable stack traces support. The system only warns now about
it.

It only makes sense to reintroduce the compile time checking in
klp_try_switch_task() again and bail out early.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 16:08:41 +02:00
Miroslav Benes
380178ef7f stacktrace: Remove weak version of save_stack_trace_tsk_reliable()
Recent rework of stack trace infrastructure introduced a new set of
helpers for common stack trace operations (commit e9b98e162a
("stacktrace: Provide helpers for common stack trace operations") and
related). As a result, save_stack_trace_tsk_reliable() is not directly
called anywhere. Livepatch, currently the only user of the reliable
stack trace feature, now calls stack_trace_save_tsk_reliable().

When CONFIG_HAVE_RELIABLE_STACKTRACE is set and depending on
CONFIG_ARCH_STACKWALK, stack_trace_save_tsk_reliable() calls either
arch_stack_walk_reliable() or mentioned save_stack_trace_tsk_reliable().
x86_64 defines the former, ppc64le the latter. All other architectures
do not have HAVE_RELIABLE_STACKTRACE and include/linux/stacktrace.h
defines -ENOSYS returning version for them.

In short, stack_trace_save_tsk_reliable() returning -ENOSYS defined in
include/linux/stacktrace.h serves the same purpose as the old weak
version of save_stack_trace_tsk_reliable() which is therefore no longer
needed.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 15:43:31 +02:00
David S. Miller
dca73a65a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-06-19

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.

2) BTF based map definition, from Andrii.

3) support bpf_map_lookup_elem for xskmap, from Jonathan.

4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20 00:06:27 -04:00
Paul E. McKenney
11ca7a9d54 Merge branches 'consolidate.2019.05.28a', 'doc.2019.05.28a', 'fixes.2019.06.13a', 'srcu.2019.05.28a', 'sync.2019.05.28a' and 'torture.2019.05.28a' into HEAD
consolidate.2019.05.28a: RCU flavor consolidation cleanups and optmizations.
doc.2019.05.28a: Documentation updates.
fixes.2019.06.13a: Miscellaneous fixes.
srcu.2019.05.28a: SRCU updates.
sync.2019.05.28a: RCU-sync flavor consolidation.
torture.2019.05.28a: Torture-test updates.
2019-06-19 09:21:46 -07:00
Jesper Dangaard Brouer
6bf071bf09 xdp: page_pool related fix to cpumap
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models.  The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.

Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
David Howells
7743c48e54 keys: Cache result of request_key*() temporarily in task_struct
If a filesystem uses keys to hold authentication tokens, then it needs a
token for each VFS operation that might perform an authentication check -
either by passing it to the server, or using to perform a check based on
authentication data cached locally.

For open files this isn't a problem, since the key should be cached in the
file struct since it represents the subject performing operations on that
file descriptor.

During pathwalk, however, there isn't anywhere to cache the key, except
perhaps in the nameidata struct - but that isn't exposed to the
filesystems.  Further, a pathwalk can incur a lot of operations, calling
one or more of the following, for instance:

	->lookup()
	->permission()
	->d_revalidate()
	->d_automount()
	->get_acl()
	->getxattr()

on each dentry/inode it encounters - and each one may need to call
request_key().  And then, at the end of pathwalk, it will call the actual
operation:

	->mkdir()
	->mknod()
	->getattr()
	->open()
	...

which may need to go and get the token again.

However, it is very likely that all of the operations on a single
dentry/inode - and quite possibly a sequence of them - will all want to use
the same authentication token, which suggests that caching it would be a
good idea.

To this end:

 (1) Make it so that a positive result of request_key() and co. that didn't
     require upcalling to userspace is cached temporarily in task_struct.

 (2) The cache is 1 deep, so a new result displaces the old one.

 (3) The key is released by exit and by notify-resume.

 (4) The cache is cleared in a newly forked process.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-19 16:10:15 +01:00
Thomas Gleixner
d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Thomas Gleixner
f85d208658 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 451
Based on 1 normalized pattern(s):

  this file is subject to the terms and conditions of version 2 of the
  gnu general public license see the file copying in the main
  directory of the linux distribution for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081200.872755311@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:08 +02:00
Thomas Gleixner
8092f73c51 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248
Based on 1 normalized pattern(s):

  this file is released under the gpl v2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204655.103854853@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:08 +02:00
Thomas Gleixner
40b0b3f8fb treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230
Based on 2 normalized pattern(s):

  this source code is licensed under the gnu general public license
  version 2 see the file copying for more details

  this source code is licensed under general public license version 2
  see

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 52 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:06 +02:00
Konrad Rzeszutek Wilk
8492101e15 Merge branch 'stable/for-linus-5.2' into devel/for-linus-5.2
* stable/for-linus-5.2:
  swiotlb: fix phys_addr_t overflow warning
2019-06-19 10:29:24 -04:00
Arnd Bergmann
9c106119f6 swiotlb: fix phys_addr_t overflow warning
On architectures that have a larger dma_addr_t than phys_addr_t,
the swiotlb_tbl_map_single() function truncates its return code
in the failure path, making it impossible to identify the error
later, as we compare to the original value:

kernel/dma/swiotlb.c:551:9: error: implicit conversion from 'dma_addr_t' (aka 'unsigned long long') to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Werror,-Wconstant-conversion]
        return DMA_MAPPING_ERROR;

Use an explicit typecast here to convert it to the narrower type,
and use the same expression in the error handling later.

Fixes: b907e20508 ("swiotlb: remove SWIOTLB_MAP_ERROR")
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-06-19 10:28:54 -04:00
Rafael J. Wysocki
0b385a0c3b PM: suspend: Rename pm_suspend_via_s2idle()
The name of pm_suspend_via_s2idle() is confusing, as it doesn't
reflect the purpose of the function precisely enough and it is
very similar to pm_suspend_via_firmware(), which has a different
purpose, so rename it as pm_suspend_default_s2idle() and update
its only caller, i8042_register_ports(), accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-06-19 11:41:26 +02:00
Alexei Starovoitov
b5dc0163d8 bpf: precise scalar_value tracking
Introduce precision tracking logic that
helps cilium programs the most:
                  old clang  old clang    new clang  new clang
                          with all patches         with all patches
bpf_lb-DLB_L3.o      1838     2283         1923       1863
bpf_lb-DLB_L4.o      3218     2657         3077       2468
bpf_lb-DUNKNOWN.o    1064     545          1062       544
bpf_lxc-DDROP_ALL.o  26935    23045        166729     22629
bpf_lxc-DUNKNOWN.o   34439    35240        174607     28805
bpf_netdev.o         9721     8753         8407       6801
bpf_overlay.o        6184     7901         5420       4754
bpf_lxc_jit.o        39389    50925        39389      50925

Consider code:
654: (85) call bpf_get_hash_recalc#34
655: (bf) r7 = r0
656: (15) if r8 == 0x0 goto pc+29
657: (bf) r2 = r10
658: (07) r2 += -48
659: (18) r1 = 0xffff8881e41e1b00
661: (85) call bpf_map_lookup_elem#1
662: (15) if r0 == 0x0 goto pc+23
663: (69) r1 = *(u16 *)(r0 +0)
664: (15) if r1 == 0x0 goto pc+21
665: (bf) r8 = r7
666: (57) r8 &= 65535
667: (bf) r2 = r8
668: (3f) r2 /= r1
669: (2f) r2 *= r1
670: (bf) r1 = r8
671: (1f) r1 -= r2
672: (57) r1 &= 255
673: (25) if r1 > 0x1e goto pc+12
 R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
674: (67) r1 <<= 1
675: (0f) r0 += r1

At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
R1 has to be precise for later operations on R0 to be validated properly.

The verifier will backtrack the above code in the following way:
last_idx 675 first_idx 664
regs=2 stack=0 before 675: (0f) r0 += r1         // started backtracking R1 regs=2 is a bitmask
regs=2 stack=0 before 674: (67) r1 <<= 1
regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
regs=2 stack=0 before 672: (57) r1 &= 255
regs=2 stack=0 before 671: (1f) r1 -= r2         // now both R1 and R2 has to be precise -> regs=6 mask
regs=6 stack=0 before 670: (bf) r1 = r8          // after this insn R8 and R2 has to be precise
regs=104 stack=0 before 669: (2f) r2 *= r1       // after this one R8, R2, and R1
regs=106 stack=0 before 668: (3f) r2 /= r1
regs=106 stack=0 before 667: (bf) r2 = r8
regs=102 stack=0 before 666: (57) r8 &= 65535
regs=102 stack=0 before 665: (bf) r8 = r7
regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
 // this is the end of verifier state. The following regs will be marked precised:
 R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
parent didn't have regs=82 stack=0 marks         // so backtracking continues into parent state
last_idx 663 first_idx 655
regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0)   // R1 was assigned no need to track it further
regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23    // keep tracking R7
regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1  // keep tracking R7
regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
regs=80 stack=0 before 658: (07) r2 += -48
regs=80 stack=0 before 657: (bf) r2 = r10
regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
regs=80 stack=0 before 655: (bf) r7 = r0                // here the assignment into R7
 // mark R0 to be precise:
 R0_rw=invP(id=0)
parent didn't have regs=1 stack=0 marks                 // regs=1 -> tracking R0
last_idx 654 first_idx 644
regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
  // nothing further to backtrack

Two scalar registers not marked precise are equivalent from state pruning point of view.
More details in the patch comments.

It doesn't support bpf2bpf calls yet and enabled for root only.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:52 +02:00
Alexei Starovoitov
eea1c227b9 bpf: fix callees pruning callers
The commit 7640ead939 partially resolved the issue of callees
incorrectly pruning the callers.
With introduction of bounded loops and jmps_processed heuristic
single verifier state may contain multiple branches and calls.
It's possible that new verifier state (for future pruning) will be
allocated inside callee. Then callee will exit (still within the same
verifier state). It will go back to the caller and there R6-R9 registers
will be read and will trigger mark_reg_read. But the reg->live for all frames
but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail
to propagate liveness into parent and future walking will incorrectly
conclude that the states are equivalent because LIVE_READ is not set.
In other words the rule for parent/live should be:
whenever register parentage chain is set the reg->live should be set to LIVE_NONE.
is_state_visited logic already follows this rule for spilled registers.

Fixes: 7640ead939 ("bpf: verifier: make sure callees don't prune with caller differences")
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
2589726d12 bpf: introduce bounded loops
Allow the verifier to validate the loops by simulating their execution.
Exisiting programs have used '#pragma unroll' to unroll the loops
by the compiler. Instead let the verifier simulate all iterations
of the loop.
In order to do that introduce parentage chain of bpf_verifier_state and
'branches' counter for the number of branches left to explore.
See more detailed algorithm description in bpf_verifier.h

This algorithm borrows the key idea from Edward Cree approach:
https://patchwork.ozlabs.org/patch/877222/
Additional state pruning heuristics make such brute force loop walk
practical even for large loops.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
fb8d251ee2 bpf: extend is_branch_taken to registers
This patch extends is_branch_taken() logic from JMP+K instructions
to JMP+X instructions.
Conditional branches are often done when src and dst registers
contain known scalars. In such case the verifier can follow
the branch that is going to be taken when program executes.
That speeds up the verification and is essential feature to support
bounded loops.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
f7cf25b202 bpf: track spill/fill of constants
Compilers often spill induction variables into the stack,
hence it is necessary for the verifier to track scalar values
of the registers through stack slots.

Also few bpf programs were incorrectly rejected in the past,
since the verifier was not able to track such constants while
they were used to compute offsets into packet headers.

Tracking constants through the stack significantly decreases
the chances of state pruning, since two different constants
are considered to be different by state equivalency.
End result that cilium tests suffer serious degradation in the number
of states processed and corresponding verification time increase.

                     before  after
bpf_lb-DLB_L3.o      1838    6441
bpf_lb-DLB_L4.o      3218    5908
bpf_lb-DUNKNOWN.o    1064    1064
bpf_lxc-DDROP_ALL.o  26935   93790
bpf_lxc-DUNKNOWN.o   34439   123886
bpf_netdev.o         9721    31413
bpf_overlay.o        6184    18561
bpf_lxc_jit.o        39389   359445

After further debugging turned out that cillium progs are
getting hurt by clang due to the same constant tracking issue.
Newer clang generates better code by spilling less to the stack.
Instead it keeps more constants in the registers which
hurts state pruning since the verifier already tracks constants
in the registers:
                  old clang  new clang
                         (no spill/fill tracking introduced by this patch)
bpf_lb-DLB_L3.o      1838    1923
bpf_lb-DLB_L4.o      3218    3077
bpf_lb-DUNKNOWN.o    1064    1062
bpf_lxc-DDROP_ALL.o  26935   166729
bpf_lxc-DUNKNOWN.o   34439   174607
bpf_netdev.o         9721    8407
bpf_overlay.o        6184    5420
bpf_lcx_jit.o        39389   39389

The final table is depressing:
                  old clang  old clang    new clang  new clang
                           const spill/fill        const spill/fill
bpf_lb-DLB_L3.o      1838    6441          1923      8128
bpf_lb-DLB_L4.o      3218    5908          3077      6707
bpf_lb-DUNKNOWN.o    1064    1064          1062      1062
bpf_lxc-DDROP_ALL.o  26935   93790         166729    380712
bpf_lxc-DUNKNOWN.o   34439   123886        174607    440652
bpf_netdev.o         9721    31413         8407      31904
bpf_overlay.o        6184    18561         5420      23569
bpf_lxc_jit.o        39389   359445        39389     359445

Tracking constants in the registers hurts state pruning already.
Adding tracking of constants through stack hurts pruning even more.
The later patch address this general constant tracking issue
with coarse/precise logic.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
David S. Miller
13091aa305 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 20:20:36 -07:00
Gustavo A. R. Silva
f0553dcb97 tracepoint: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct tp_probes {
	...
        struct tracepoint_func probes[0];
};

instance = kmalloc(sizeof(sizeof(struct tp_probes) +
			sizeof(struct tracepoint_func) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, probes, count) GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-17 21:13:32 -04:00
Linus Torvalds
da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Peter Zijlstra
8dc2d993cf x86/percpu, sched/fair: Avoid local_clock()
Nadav reported that code-gen changed because of the this_cpu_*()
constraints, avoid this for select_idle_cpu() because that runs with
preemption (and IRQs) disabled anyway.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:43:43 +02:00
Ingo Molnar
bddb363673 Merge branch 'x86/cpu' into perf/core, to pick up dependent changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:29:16 +02:00
Waiman Long
a15ea1a35f locking/rwsem: Guard against making count negative
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:11 +02:00
Waiman Long
5cfd92e12e locking/rwsem: Adaptive disabling of reader optimistic spinning
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

  Threads  Before rspin  Before patch  After patch    %change
  -------  ------------  ------------  -----------    -------
    20        5541068      5345484       5455667    -3.5%/ +2.1%
    40       10185150      7292313       9219276   -28.5%/+26.4%
    60        8196733      6460517       7181209   -21.2%/+11.2%
    80        9508864      6739559       8107025   -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          4,730        4,969
        4          4,814        4,786
        8          4,866        4,815
       16          4,715        4,511
       32          3,338        3,500
       64          3,212        3,389
       80          3,110        3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

  40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
  40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

  40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
  40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:09 +02:00
Waiman Long
7d43f1ce9d locking/rwsem: Enable time-based spinning on reader-owned rwsem
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:07 +02:00
Waiman Long
94a9717b3c locking/rwsem: Make rwsem->owner an atomic_long_t
The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.

New helper functions are added to properly separate out the task
structure pointer and the embedded flags.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:06 +02:00
Waiman Long
cf69482d62 locking/rwsem: Enable readers spinning on writer
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:05 +02:00
Waiman Long
02f1082b00 locking/rwsem: Clarify usage of owner's nonspinaable bit
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:03 +02:00
Waiman Long
d3681e269f locking/rwsem: Wake up almost all readers in wait queue
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:02 +02:00
Waiman Long
990fa7384a locking/rwsem: More optimal RT task handling of null owner
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:01 +02:00
Waiman Long
00f3c5a3df locking/rwsem: Always release wait_lock before waking up tasks
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:00 +02:00
Waiman Long
4f23dbc1e6 locking/rwsem: Implement lock handoff to prevent lock starvation
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:59 +02:00
Waiman Long
3f6d517a3e locking/rwsem: Make rwsem_spin_on_owner() return owner state
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-7-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:59 +02:00
Waiman Long
6cef7ff6e4 locking/rwsem: Code cleanup after files merging
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
    and rename __rwsem_down_{read|write}_failed_common() to
    rwsem_down_{read|write}_slowpath().
 6) Remove "__" prefix of __rwsem_mark_wake().
 7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
    are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-6-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:58 +02:00
Waiman Long
5dec94d492 locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-5-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:57 +02:00
Waiman Long
64489e7800 locking/rwsem: Implement a new locking scheme
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-4-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:56 +02:00
Waiman Long
5c1ec49b60 locking/rwsem: Remove rwsem_wake() wakeup optimization
After the following commit:

  59aabfc7e9 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")

the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
acquired and an optimistic spinning locker is present.  This can help performance
by avoiding spinning on the wait_lock when it is contended.

With the later commit:

  133e89ef5e ("locking/rwsem: Enable lockless waiter wakeup(s)")

the performance advantage of the above optimization diminishes as the average
wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang.

So the original 59aabfc7e9 commit has to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-3-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:55 +02:00
Waiman Long
c71fd893f6 locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:54 +02:00
bsegall@google.com
66567fcbae sched/fair: Don't push cfs_bandwith slack timers forward
When a cfs_rq sleeps and returns its quota, we delay for 5ms before
waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
sleep, as this has to be done outside of the rq lock we hold.

The current code waits for 5ms without any sleeps, instead of waiting
for 5ms from the first sleep, which can delay the unthrottle more than
we want. Switch this around so that we can't push this forward forever.

This requires an extra flag rather than using hrtimer_active, since we
need to start a new timer if the current one is in the process of
finishing.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:16:01 +02:00
Peter Zijlstra
aacedf26fb sched/core: Optimize try_to_wake_up() for local wakeups
Jens reported that significant performance can be had on some block
workloads by special casing local wakeups. That is, wakeups on the
current task before it schedules out.

Given something like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: gkohli@codeaurora.org
Cc: hch@lst.de
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:59 +02:00
Qian Cai
509466b7d4 sched/fair: Fix "runnable_avg_yN_inv" not used warnings
runnable_avg_yN_inv[] is only used in kernel/sched/pelt.c but was
included in several other places because they need other macros all
came from kernel/sched/sched-pelt.h which was generated by
Documentation/scheduler/sched-pelt. As the result, it causes compilation
a lot of warnings,

  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  ...

Silence it by appending the __maybe_unused attribute for it, so all
generated variables and macros can still be kept in the same file.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1559596304-31581-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:58 +02:00
Valentin Schneider
b0c7922441 sched/fair: Clean up definition of NOHZ blocked load functions
cfs_rq_has_blocked() and others_have_blocked() are only used within
update_blocked_averages(). The !CONFIG_FAIR_GROUP_SCHED version of the
latter calls them within a #define CONFIG_NO_HZ_COMMON block, whereas
the CONFIG_FAIR_GROUP_SCHED one calls them unconditionnally.

As reported by Qian, the above leads to this warning in
!CONFIG_NO_HZ_COMMON configs:

  kernel/sched/fair.c: In function 'update_blocked_averages':
  kernel/sched/fair.c:7750:7: warning: variable 'done' set but not used [-Wunused-but-set-variable]

It wouldn't be wrong to keep cfs_rq_has_blocked() and
others_have_blocked() as they are, but since their only current use is
to figure out when we can stop calling update_blocked_averages() on
fully decayed NOHZ idle CPUs, we can give them a new definition for
!CONFIG_NO_HZ_COMMON.

Change the definition of cfs_rq_has_blocked() and
others_have_blocked() for !CONFIG_NO_HZ_COMMON so that the
NOHZ-specific blocks of update_blocked_averages() become no-ops and
the 'done' variable gets optimised out.

While at it, remove the CONFIG_NO_HZ_COMMON block from the
!CONFIG_FAIR_GROUP_SCHED definition of update_blocked_averages() by
using the newly-introduced update_blocked_load_status() helper.

No change in functionality intended.

[ Additions by Peter Zijlstra. ]

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190603115424.7951-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:57 +02:00
Gao Xiang
e3b929b0a1 sched/core: Add __sched tag for io_schedule()
Non-inline io_schedule() was introduced in:

  commit 10ab56434f ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")

Keep in line with io_schedule_timeout(), otherwise "/proc/<pid>/wchan" will
report io_schedule() rather than its callers when waiting for IO.

Reported-by: Jilong Kou <koujilong@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 10ab56434f ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")
Link: https://lkml.kernel.org/r/20190603091338.2695-1-gaoxiang25@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:56 +02:00
Ingo Molnar
23da766ab1 Linux 5.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Gj1MeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGctkH/0At3+SQPY2JJSy8
 i6+TDeytFx9OggeGLPHChRfehkAlvMb/kd34QHnuEvDqUuCAMU6HZQJFKoK9mvFI
 sDJVayPGDSqpm+iv8qLpMBPShiCXYVnGZeVfOdv36jUswL0k6wHV1pz4avFkDeZa
 1F4pmI6O2XRkNTYQawbUaFkAngWUCBG9ECLnHJnuIY6ohShBvjI4+E2JUaht+8gO
 M2h2b9ieddWmjxV3LTKgsK1v+347RljxdZTWnJ62SCDSEVZvsgSA9W2wnebVhBkJ
 drSmrFLxNiM+W45mkbUFmQixRSmjv++oRR096fxAnodBxMw0TDxE1RiMQWE6rVvG
 N6MC6xA=
 =+B0P
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc5' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:12:27 +02:00
Peter Zijlstra
085ebfe937 perf/core: Fix perf_sample_regs_user() mm check
perf_sample_regs_user() uses 'current->mm' to test for the presence of
userspace, but this is insufficient, consider use_mm().

A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
PF_KTHREAD after it sets the new ->mm but before it drops to userspace
for the first time.

Possibly obsoletes: bf05fc25f2 ("powerpc/perf: Fix oops when kthread execs user process")

Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4018994f3d ("perf: Add ability to attach user level registers dump to sample")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:11:58 +02:00
Kobe Wu
dd471efe34 locking/lockdep: Remove unnecessary DEBUG_LOCKS_WARN_ON()
DEBUG_LOCKS_WARN_ON() will turn off debug_locks and
makes print_unlock_imbalance_bug() return directly.

Remove a redundant whitespace.

Signed-off-by: Kobe Wu <kobe-cp.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Eason Lin <eason-yh.lin@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/1559217575-30298-1-git-send-email-kobe-cp.wu@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:37 +02:00
Daniel Bristot de Oliveira
c2ba8a15f3 jump_label: Batch updates if arch supports it
If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
  #!/bin/sh
  while [ true ]; do
      sysctl -w kernel.sched_schedstats=1
      sleep 2
      sysctl -w kernel.sched_schedstats=0
      sleep 2
  done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
  # watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/acc891dbc2dbc9fd616dd680529a2337b1d1274c.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:22 +02:00
Daniel Bristot de Oliveira
0f133021bd jump_label: Sort entries of the same key by the code
In the batching mode, all the entries of a given key are updated at once.
During the update of a key, a hit in the int3 handler will check if the
hitting code address belongs to one of these keys.

To optimize the search of a given code in the vector of entries being
updated, a binary search is used. The binary search relies on the order
of the entries of a key by its code. Hence the keys need to be sorted
by the code too, so sort the entries of a given key by the code.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f57ae83e0592418ba269866bb7ade570fc8632e0.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:21 +02:00
Daniel Bristot de Oliveira
e1aacb3f4a jump_label: Add a jump_label_can_update() helper
Move the check if a jump_entry is valid to a function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/56b69bd3f8e644ed64f2dbde7c088030b8cbe76b.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:19 +02:00
Ingo Molnar
410df0c574 Linux 5.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Gj1MeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGctkH/0At3+SQPY2JJSy8
 i6+TDeytFx9OggeGLPHChRfehkAlvMb/kd34QHnuEvDqUuCAMU6HZQJFKoK9mvFI
 sDJVayPGDSqpm+iv8qLpMBPShiCXYVnGZeVfOdv36jUswL0k6wHV1pz4avFkDeZa
 1F4pmI6O2XRkNTYQawbUaFkAngWUCBG9ECLnHJnuIY6ohShBvjI4+E2JUaht+8gO
 M2h2b9ieddWmjxV3LTKgsK1v+347RljxdZTWnJ62SCDSEVZvsgSA9W2wnebVhBkJ
 drSmrFLxNiM+W45mkbUFmQixRSmjv++oRR096fxAnodBxMw0TDxE1RiMQWE6rVvG
 N6MC6xA=
 =+B0P
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:06:34 +02:00
Linus Torvalds
efba92d58f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A set of small fixes:

   - Repair the ktime_get_coarse() functions so they actually deliver
     what they are supposed to: tick granular time stamps. The current
     code missed to add the accumulated nanoseconds part of the
     timekeeper so the resulting granularity was 1 second.

   - Prevent the tracer from infinitely recursing into time getter
     functions in the arm architectured timer by marking these functions
     notrace

   - Fix a trivial compiler warning caused by wrong qualifier ordering"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Repair ktime_get_coarse*() granularity
  clocksource/drivers/arm_arch_timer: Don't trace count reader functions
  clocksource/drivers/timer-ti-dm: Change to new style declaration
2019-06-16 07:22:56 -10:00
David S. Miller
1eb4169c1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-06-15

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix stack layout of JITed x64 bpf code, from Alexei.

2) fix out of bounds memory access in bpf_sk_storage, from Arthur.

3) fix lpm trie walk, from Jonathan.

4) fix nested bpf_perf_event_output, from Matt.

5) and several other fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:19:47 -07:00
Matt Mullins
9594dc3c7e bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.

This enables three levels of nesting, to support
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to call, and
  - another one in nmi context
(at most one of which may be a kprobe or perf event).

Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15 16:33:35 -07:00
Linus Torvalds
6a71398c6a This includes the following fixes:
- Out of range read of stack trace output
  - Fix for NULL pointer dereference in trace_uprobe_create()
  - Fix to a livepatching / ftrace permission race in the module code
  - Fix for NULL pointer dereference in free_ftrace_func_mapper()
  - A couple of build warning clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXQToxhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qusmAP4/mmJPgsDchnu5ui0wB8BByzJlsPn8
 luXFDuqI4f34zgD+JCmeYbj5LLh98D9XkaaEgP4yz3yKsWeSdwWPCU0vTgo=
 =M/+E
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Out of range read of stack trace output

 - Fix for NULL pointer dereference in trace_uprobe_create()

 - Fix to a livepatching / ftrace permission race in the module code

 - Fix for NULL pointer dereference in free_ftrace_func_mapper()

 - A couple of build warning clean ups

* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
  module: Fix livepatch/ftrace module text permissions race
  tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
  tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
  tracing: Make two symbols static
  tracing: avoid build warning with HAVE_NOP_MCOUNT
  tracing: Fix out-of-range read in trace_stack_print()
2019-06-15 07:24:11 -10:00
Heiko Carstens
4ecf0a43e7 processor: get rid of cpu_relax_yield
stop_machine is the only user left of cpu_relax_yield. Given that it
now has special semantics which are tied to stop_machine introduce a
weak stop_machine_yield function which architectures can override, and
get rid of the generic cpu_relax_yield implementation.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-06-15 12:25:55 +02:00
Martin Schwidefsky
38f2c691a4 s390: improve wait logic of stop_machine
The stop_machine loop to advance the state machine and to wait for all
affected CPUs to check-in calls cpu_relax_yield in a tight loop until
the last missing CPUs acknowledged the state transition.

On a virtual system where not all logical CPUs are backed by real CPUs
all the time it can take a while for all CPUs to check-in. With the
current definition of cpu_relax_yield a diagnose 0x44 is done which
tells the hypervisor to schedule *some* other CPU. That can be any
CPU and not necessarily one of the CPUs that need to run in order to
advance the state machine. This can lead to a pretty bad diagnose 0x44
storm until the last missing CPU finally checked-in.

Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
directed yield. Each CPU in the wait loop will pick up the next CPU
in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
hypervisor to run this next CPU instead of the current one. If there
is only a limited number of real CPUs backing the virtual CPUs we
end up with the real CPUs passed around in a round-robin fashion.

[heiko.carstens@de.ibm.com]:
    Use cpumask_next_wrap as suggested by Peter Zijlstra.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-06-15 12:25:52 +02:00
Linus Torvalds
0011572c88 Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This has an unusually high density of tricky fixes:

   - task_get_css() could deadlock when it races against a dying cgroup.

   - cgroup.procs didn't list thread group leaders with live threads.

     This could mislead readers to think that a cgroup is empty when
     it's not. Fixed by making PROCS iterator include dead tasks. I made
     a couple mistakes making this change and this pull request contains
     a couple follow-up patches.

   - When cpusets run out of online cpus, it updates cpusmasks of member
     tasks in bizarre ways. Joel improved the behavior significantly"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: restore sanity to cpuset_cpus_allowed_fallback()
  cgroup: Fix css_task_iter_advance_css_set() cset skip condition
  cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
  cgroup: Include dying leaders with live threads in PROCS iterations
  cgroup: Implement css_task_iter_skip()
  cgroup: Call cgroup_release() before __exit_signal()
  docs cgroups: add another example size for hugetlb
  cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
2019-06-14 17:46:14 -10:00
Eric Dumazet
a8e11e5c56 sysctl: define proc_do_static_key()
Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.

Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:27 -07:00
Toshiaki Makita
86723c8640 bpf, devmap: Add missing RCU read lock on flush
.ndo_xdp_xmit() assumes it is called under RCU. For example virtio_net
uses RCU to detect it has setup the resources for tx. The assumption
accidentally broke when introducing bulk queue in devmap.

Fixes: 5d053f9da4 ("bpf: devmap prepare xdp frames for bulking")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:51 +02:00
Toshiaki Makita
edabf4d9dd bpf, devmap: Add missing bulk queue free
dev_map_free() forgot to free bulk queue when freeing its entries.

Fixes: 5d053f9da4 ("bpf: devmap prepare xdp frames for bulking")
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:47 +02:00
Toshiaki Makita
d4dd153d55 bpf, devmap: Fix premature entry free on destroying map
dev_map_free() waits for flush_needed bitmap to be empty in order to
ensure all flush operations have completed before freeing its entries.
However the corresponding clear_bit() was called before using the
entries, so the entries could be used after free.

All access to the entries needs to be done before clearing the bit.
It seems commit a5e2da6e97 ("bpf: netdev is never null in
__dev_map_flush") accidentally changed the clear_bit() and memory access
order.

Note that the problem happens only in __dev_map_flush(), not in
dev_map_flush_old(). dev_map_flush_old() is called only after nulling
out the corresponding netdev_map entry, so dev_map_free() never frees
the entry thus no such race happens there.

Fixes: a5e2da6e97 ("bpf: netdev is never null in __dev_map_flush")
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:42 +02:00
Wei Li
04e03d9a61 ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
The mapper may be NULL when called from register_ftrace_function_probe()
with probe->data == NULL.

This issue can be reproduced as follow (it may be covered by compiler
optimization sometime):

/ # cat /sys/kernel/debug/tracing/set_ftrace_filter
#### all functions enabled ####
/ # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
[  206.949100] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  206.952402] Mem abort info:
[  206.952819]   ESR = 0x96000006
[  206.955326]   Exception class = DABT (current EL), IL = 32 bits
[  206.955844]   SET = 0, FnV = 0
[  206.956272]   EA = 0, S1PTW = 0
[  206.956652] Data abort info:
[  206.957320]   ISV = 0, ISS = 0x00000006
[  206.959271]   CM = 0, WnR = 0
[  206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000419f3a000
[  206.960483] [0000000000000000] pgd=0000000411a87003, pud=0000000411a83003, pmd=0000000000000000
[  206.964953] Internal error: Oops: 96000006 [#1] SMP
[  206.971122] Dumping ftrace buffer:
[  206.973677]    (ftrace buffer empty)
[  206.975258] Modules linked in:
[  206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
[  206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
[  206.978955] Hardware name: linux,dummy-virt (DT)
[  206.979883] pstate: 60000005 (nZCv daif -PAN -UAO)
[  206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
[  206.980874] lr : ftrace_count_free+0x68/0x80
[  206.982539] sp : ffff0000182f3ab0
[  206.983102] x29: ffff0000182f3ab0 x28: ffff8003d0ec1700
[  206.983632] x27: ffff000013054b40 x26: 0000000000000001
[  206.984000] x25: ffff00001385f000 x24: 0000000000000000
[  206.984394] x23: ffff000013453000 x22: ffff000013054000
[  206.984775] x21: 0000000000000000 x20: ffff00001385fe28
[  206.986575] x19: ffff000013872c30 x18: 0000000000000000
[  206.987111] x17: 0000000000000000 x16: 0000000000000000
[  206.987491] x15: ffffffffffffffb0 x14: 0000000000000000
[  206.987850] x13: 000000000017430e x12: 0000000000000580
[  206.988251] x11: 0000000000000000 x10: cccccccccccccccc
[  206.988740] x9 : 0000000000000000 x8 : ffff000013917550
[  206.990198] x7 : ffff000012fac2e8 x6 : ffff000012fac000
[  206.991008] x5 : ffff0000103da588 x4 : 0000000000000001
[  206.991395] x3 : 0000000000000001 x2 : ffff000013872a28
[  206.991771] x1 : 0000000000000000 x0 : 0000000000000000
[  206.992557] Call trace:
[  206.993101]  free_ftrace_func_mapper+0x2c/0x118
[  206.994827]  ftrace_count_free+0x68/0x80
[  206.995238]  release_probe+0xfc/0x1d0
[  206.995555]  register_ftrace_function_probe+0x4a8/0x868
[  206.995923]  ftrace_trace_probe_callback.isra.4+0xb8/0x180
[  206.996330]  ftrace_dump_callback+0x50/0x70
[  206.996663]  ftrace_regex_write.isra.29+0x290/0x3a8
[  206.997157]  ftrace_filter_write+0x44/0x60
[  206.998971]  __vfs_write+0x64/0xf0
[  206.999285]  vfs_write+0x14c/0x2f0
[  206.999591]  ksys_write+0xbc/0x1b0
[  206.999888]  __arm64_sys_write+0x3c/0x58
[  207.000246]  el0_svc_common.constprop.0+0x408/0x5f0
[  207.000607]  el0_svc_handler+0x144/0x1c8
[  207.000916]  el0_svc+0x8/0xc
[  207.003699] Code: aa0003f8 a9025bf5 aa0103f5 f946ea80 (f9400303)
[  207.008388] ---[ end trace 7b6d11b5f542bdf1 ]---
[  207.010126] Kernel panic - not syncing: Fatal exception
[  207.011322] SMP: stopping secondary CPUs
[  207.013956] Dumping ftrace buffer:
[  207.014595]    (ftrace buffer empty)
[  207.015632] Kernel Offset: disabled
[  207.017187] CPU features: 0x002,20006008
[  207.017985] Memory Limit: none
[  207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.com

Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:40:21 -04:00
Mauro Carvalho Chehab
151f4e2bdc docs: power: convert docs to ReST and rename to *.rst
Convert the PM documents to ReST, in order to allow them to
build with Sphinx.

The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
2019-06-14 16:08:36 -05:00
Josh Poimboeuf
9f255b632b module: Fix livepatch/ftrace module text permissions race
It's possible for livepatch and ftrace to be toggling a module's text
permissions at the same time, resulting in the following panic:

  BUG: unable to handle page fault for address: ffffffffc005b1d9
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0003) - permissions violation
  PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
  Oops: 0003 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339ca5 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
  RIP: 0010:apply_relocate_add+0xbe/0x14c
  Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
  RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
  RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
  RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
  RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
  R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
  R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
  FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   klp_init_object_loaded+0x10f/0x219
   ? preempt_latency_start+0x21/0x57
   klp_enable_patch+0x662/0x809
   ? virt_to_head_page+0x3a/0x3c
   ? kfree+0x8c/0x126
   patch_init+0x2ed/0x1000 [livepatch_test02]
   ? 0xffffffffc0060000
   do_one_initcall+0x9f/0x1c5
   ? kmem_cache_alloc_trace+0xc4/0xd4
   ? do_init_module+0x27/0x210
   do_init_module+0x5f/0x210
   load_module+0x1c41/0x2290
   ? fsnotify_path+0x3b/0x42
   ? strstarts+0x2b/0x2b
   ? kernel_read+0x58/0x65
   __do_sys_finit_module+0x9f/0xc3
   ? __do_sys_finit_module+0x9f/0xc3
   __x64_sys_finit_module+0x1a/0x1c
   do_syscall_64+0x52/0x61
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The above panic occurs when loading two modules at the same time with
ftrace enabled, where at least one of the modules is a livepatch module:

CPU0					CPU1
klp_enable_patch()
  klp_init_object_loaded()
    module_disable_ro()
    					ftrace_module_enable()
					  ftrace_arch_code_modify_post_process()
				    	    set_all_modules_text_ro()
      klp_write_object_relocations()
        apply_relocate_add()
	  *patches read-only code* - BOOM

A similar race exists when toggling ftrace while loading a livepatch
module.

Fix it by ensuring that the livepatch and ftrace code patching
operations -- and their respective permissions changes -- are protected
by the text_mutex.

Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.com

Reported-by: Johannes Erdfelt <johannes@erdfelt.com>
Fixes: 444d13ff10 ("modules: add ro_after_init support")
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:01:50 -04:00
Eiichi Tsukata
a4158345ec tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
Commit 0597c49c69 ("tracing/uprobes: Use dyn_event framework for
uprobe events") cleaned up the usage of trace_uprobe_create(), and the
function has been no longer used for removing uprobe/uretprobe.

Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.com

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:00:36 -04:00
Eiichi Tsukata
f01098c74b tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
Just like the case of commit 8b05a3a750 ("tracing/kprobes: Fix NULL
pointer dereference in trace_kprobe_create()"), writing an incorrectly
formatted string to uprobe_events can trigger NULL pointer dereference.

Reporeducer:

  # echo r > /sys/kernel/debug/tracing/uprobe_events

dmesg:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 8000000079d12067 P4D 8000000079d12067 PUD 7b7ab067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  RIP: 0010:strchr+0x0/0x30
  Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 75 0e eb 13 0f b6 47 01 48 83 c
  RSP: 0018:ffffb55fc0403d10 EFLAGS: 00010293

  RAX: ffff993ffb793400 RBX: 0000000000000000 RCX: ffffffffa4852625
  RDX: 0000000000000000 RSI: 000000000000002f RDI: 0000000000000000
  RBP: ffffb55fc0403dd0 R08: ffff993ffb793400 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff993ff9cc1668 R14: 0000000000000001 R15: 0000000000000000
  FS:  00007f30c5147700(0000) GS:ffff993ffda00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000007b628000 CR4: 00000000000006f0
  Call Trace:
   trace_uprobe_create+0xe6/0xb10
   ? __kmalloc_track_caller+0xe6/0x1c0
   ? __kmalloc+0xf0/0x1d0
   ? trace_uprobe_create+0xb10/0xb10
   create_or_delete_trace_uprobe+0x35/0x90
   ? trace_uprobe_create+0xb10/0xb10
   trace_run_command+0x9c/0xb0
   trace_parse_run_command+0xf9/0x1eb
   ? probes_open+0x80/0x80
   __vfs_write+0x43/0x90
   vfs_write+0x14a/0x2a0
   ksys_write+0xa2/0x170
   do_syscall_64+0x7f/0x200
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/20190614074026.8045-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: 0597c49c69 ("tracing/uprobes: Use dyn_event framework for uprobe events")
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:51:14 -04:00
YueHaibing
ff585c5b9a tracing: Make two symbols static
Fix sparse warnings:

kernel/trace/trace.c:6927:24: warning:
 symbol 'get_tracing_log_err' was not declared. Should it be static?
kernel/trace/trace.c:8196:15: warning:
 symbol 'trace_instance_dir' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.com

Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:49:26 -04:00
Vasily Gorbik
cbdaeaf050 tracing: avoid build warning with HAVE_NOP_MCOUNT
Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
testing CC_USING_* defines) to avoid conditional compilation and fix
the following gcc 9 warning on s390:

kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
but not used [-Wunused-function]

Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours

Fixes: 2f4df0017b ("tracing: Add -mcount-nop option support")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:34:57 -04:00
Mauro Carvalho Chehab
d6a3b24762 docs: scheduler: convert docs to ReST and rename to *.rst
In order to prepare to add them to the Kernel API book,
convert the files to ReST format.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-14 14:32:18 -06:00
Mauro Carvalho Chehab
99c8b231ae docs: cgroup-v1: convert docs to ReST and rename to *.rst
Convert the cgroup-v1 files to ReST format, in order to
allow a later addition to the admin-guide.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-14 13:29:54 -07:00
Eiichi Tsukata
becf33f694 tracing: Fix out-of-range read in trace_stack_print()
Puts range check before dereferencing the pointer.

Reproducer:

  # echo stacktrace > trace_options
  # echo 1 > events/enable
  # cat trace > /dev/null

KASAN report:

  ==================================================================
  BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
  Read of size 8 at addr ffff888069d20000 by task cat/1953

  CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  Call Trace:
   dump_stack+0x8a/0xce
   print_address_description+0x60/0x224
   ? trace_stack_print+0x26b/0x2c0
   ? trace_stack_print+0x26b/0x2c0
   __kasan_report.cold+0x1a/0x3e
   ? trace_stack_print+0x26b/0x2c0
   kasan_report+0xe/0x20
   trace_stack_print+0x26b/0x2c0
   print_trace_line+0x6ea/0x14d0
   ? tracing_buffers_read+0x700/0x700
   ? trace_find_next_entry_inc+0x158/0x1d0
   s_show+0xea/0x310
   seq_read+0xaa7/0x10e0
   ? seq_escape+0x230/0x230
   __vfs_read+0x7c/0x100
   vfs_read+0x16c/0x3a0
   ksys_read+0x121/0x240
   ? kernel_write+0x110/0x110
   ? perf_trace_sys_enter+0x8a0/0x8a0
   ? syscall_slow_exit_work+0xa9/0x410
   do_syscall_64+0xb7/0x390
   ? prepare_exit_to_usermode+0x165/0x200
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f867681f910
  Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
  RSP: 002b:00007ffdabf23488 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867681f910
  RDX: 0000000000020000 RSI: 00007f8676cde000 RDI: 0000000000000003
  RBP: 00007f8676cde000 R08: ffffffffffffffff R09: 0000000000000000
  R10: 0000000000000871 R11: 0000000000000246 R12: 00007f8676cde000
  R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000ec0

  Allocated by task 1214:
   save_stack+0x1b/0x80
   __kasan_kmalloc.constprop.0+0xc2/0xd0
   kmem_cache_alloc+0xaf/0x1a0
   getname_flags+0xd2/0x5b0
   do_sys_open+0x277/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 1214:
   save_stack+0x1b/0x80
   __kasan_slab_free+0x12c/0x170
   kmem_cache_free+0x8a/0x1c0
   putname+0xe1/0x120
   do_sys_open+0x2c5/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff888069d20000
   which belongs to the cache names_cache of size 4096
  The buggy address is located 0 bytes inside of
   4096-byte region [ffff888069d20000, ffff888069d21000)
  The buggy address belongs to the page:
  page:ffffea0001a74800 refcount:1 mapcount:0 mapping:ffff88806ccd1380 index:0x0 compound_mapcount: 0
  flags: 0x100000000010200(slab|head)
  raw: 0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
  raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                     ^
   ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com

Fixes: 4285f2fcef ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:28:42 -04:00
Tejun Heo
38cf3a687f cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
a5e112e642 ("cgroup: add cgroup_parse_float()") accidentally added
cgroup_parse_float() inside CONFIG_SYSFS block.  Move it outside so
that it doesn't cause failures on !CONFIG_SYSFS builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a5e112e642 ("cgroup: add cgroup_parse_float()")
2019-06-14 10:14:44 -07:00
Yangtao Li
141e1ecda3 alarmtimer: Fix kerneldoc comment for alarmtimer_suspend()
This brings the kernel doc in line with the function signature.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Link: https://lkml.kernel.org/r/20190525183925.18963-1-tiny.windzz@gmail.com
2019-06-14 17:04:04 +02:00
Mathieu Malaterre
0f48b41f59 clocksource: Move inline keyword to the beginning of function declarations
The inline keyword was not at the beginning of the function declarations.
Fix the following warnings triggered when using W=1:

  kernel/time/clocksource.c:108:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
  kernel/time/clocksource.c:113:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Cc: kernel-janitors@vger.kernel.org
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190524103339.28787-1-malat@debian.org
2019-06-14 17:04:03 +02:00
Florian Fainelli
4b4b077cbd dma-remap: Avoid de-referencing NULL atomic_pool
With architectures allowing the kernel to be placed almost arbitrarily
in memory (e.g.: ARM64), it is possible to have the kernel resides at
physical addresses above 4GB, resulting in neither the default CMA area,
nor the atomic pool from successfully allocating. This does not prevent
specific peripherals from working though, one example is XHCI, which
still operates correctly.

Trouble comes when the XHCI driver gets suspended and resumed, since we
can now trigger the following NPD:

[   12.664170] usb usb1: root hub lost power or was reset
[   12.669387] usb usb2: root hub lost power or was reset
[   12.674662] Unable to handle kernel NULL pointer dereference at virtual address 00000008
[   12.682896] pgd = ffffffc1365a7000
[   12.686386] [00000008] *pgd=0000000136500003, *pud=0000000136500003, *pmd=0000000000000000
[   12.694897] Internal error: Oops: 96000006 [#1] SMP
[   12.699843] Modules linked in:
[   12.702980] CPU: 0 PID: 1499 Comm: pml Not tainted 4.9.135-1.13pre #51
[   12.709577] Hardware name: BCM97268DV (DT)
[   12.713736] task: ffffffc136bb6540 task.stack: ffffffc1366cc000
[   12.719740] PC is at addr_in_gen_pool+0x4/0x48
[   12.724253] LR is at __dma_free+0x64/0xbc
[   12.728325] pc : [<ffffff80083c0df8>] lr : [<ffffff80080979e0>] pstate: 60000145
[   12.735825] sp : ffffffc1366cf990
[   12.739196] x29: ffffffc1366cf990 x28: ffffffc1366cc000
[   12.744608] x27: 0000000000000000 x26: ffffffc13a8568c8
[   12.750020] x25: 0000000000000000 x24: ffffff80098f9000
[   12.755433] x23: 000000013a5ff000 x22: ffffff8009c57000
[   12.760844] x21: ffffffc13a856810 x20: 0000000000000000
[   12.766255] x19: 0000000000001000 x18: 000000000000000a
[   12.771667] x17: 0000007f917553e0 x16: 0000000000001002
[   12.777078] x15: 00000000000a36cb x14: ffffff80898feb77
[   12.782490] x13: ffffffffffffffff x12: 0000000000000030
[   12.787899] x11: 00000000fffffffe x10: ffffff80098feb7f
[   12.793311] x9 : 0000000005f5e0ff x8 : 65776f702074736f
[   12.798723] x7 : 6c2062756820746f x6 : ffffff80098febb1
[   12.804134] x5 : ffffff800809797c x4 : 0000000000000000
[   12.809545] x3 : 000000013a5ff000 x2 : 0000000000000fff
[   12.814955] x1 : ffffff8009c57000 x0 : 0000000000000000
[   12.820363]
[   12.821907] Process pml (pid: 1499, stack limit = 0xffffffc1366cc020)
[   12.828421] Stack: (0xffffffc1366cf990 to 0xffffffc1366d0000)
[   12.834240] f980:                                   ffffffc1366cf9e0 ffffff80086004d0
[   12.842186] f9a0: ffffffc13ab08238 0000000000000010 ffffff80097c2218 ffffffc13a856810
[   12.850131] f9c0: ffffff8009c57000 000000013a5ff000 0000000000000008 000000013a5ff000
[   12.858076] f9e0: ffffffc1366cfa50 ffffff80085f9250 ffffffc13ab08238 0000000000000004
[   12.866021] fa00: ffffffc13ab08000 ffffff80097b6000 ffffffc13ab08130 0000000000000001
[   12.873966] fa20: 0000000000000008 ffffffc13a8568c8 0000000000000000 ffffffc1366cc000
[   12.881911] fa40: ffffffc13ab08130 0000000000000001 ffffffc1366cfa90 ffffff80085e3de8
[   12.889856] fa60: ffffffc13ab08238 0000000000000000 ffffffc136b75b00 0000000000000000
[   12.897801] fa80: 0000000000000010 ffffff80089ccb92 ffffffc1366cfac0 ffffff80084ad040
[   12.905746] faa0: ffffffc13a856810 0000000000000000 ffffff80084ad004 ffffff80084b91a8
[   12.913691] fac0: ffffffc1366cfae0 ffffff80084b91b4 ffffffc13a856810 ffffff80080db5cc
[   12.921636] fae0: ffffffc1366cfb20 ffffff80084b96bc ffffffc13a856810 0000000000000010
[   12.929581] fb00: ffffffc13a856870 0000000000000000 ffffffc13a856810 ffffff800984d2b8
[   12.937526] fb20: ffffffc1366cfb50 ffffff80084baa70 ffffff8009932ad0 ffffff800984d260
[   12.945471] fb40: 0000000000000010 00000002eff0a065 ffffffc1366cfbb0 ffffff80084bafbc
[   12.953415] fb60: 0000000000000010 0000000000000003 ffffff80098fe000 0000000000000000
[   12.961360] fb80: ffffff80097b6000 ffffff80097b6dc8 ffffff80098c12b8 ffffff80098c12f8
[   12.969306] fba0: ffffff8008842000 ffffff80097b6dc8 ffffffc1366cfbd0 ffffff80080e0d88
[   12.977251] fbc0: 00000000fffffffb ffffff80080e10bc ffffffc1366cfc60 ffffff80080e16a8
[   12.985196] fbe0: 0000000000000000 0000000000000003 ffffff80097b6000 ffffff80098fe9f0
[   12.993140] fc00: ffffff80097d4000 ffffff8008983802 0000000000000123 0000000000000040
[   13.001085] fc20: ffffff8008842000 ffffffc1366cc000 ffffff80089803c2 00000000ffffffff
[   13.009029] fc40: 0000000000000000 0000000000000000 ffffffc1366cfc60 0000000000040987
[   13.016974] fc60: ffffffc1366cfcc0 ffffff80080dfd08 0000000000000003 0000000000000004
[   13.024919] fc80: 0000000000000003 ffffff80098fea08 ffffffc136577ec0 ffffff80089803c2
[   13.032864] fca0: 0000000000000123 0000000000000001 0000000500000002 0000000000040987
[   13.040809] fcc0: ffffffc1366cfd00 ffffff80083a89d4 0000000000000004 ffffffc136577ec0
[   13.048754] fce0: ffffffc136610cc0 ffffffffffffffea ffffffc1366cfeb0 ffffffc136610cd8
[   13.056700] fd00: ffffffc1366cfd10 ffffff800822a614 ffffffc1366cfd40 ffffff80082295d4
[   13.064645] fd20: 0000000000000004 ffffffc136577ec0 ffffffc136610cc0 0000000021670570
[   13.072590] fd40: ffffffc1366cfd80 ffffff80081b5d10 ffffff80097b6000 ffffffc13aae4200
[   13.080536] fd60: ffffffc1366cfeb0 0000000000000004 0000000021670570 0000000000000004
[   13.088481] fd80: ffffffc1366cfe30 ffffff80081b6b20 ffffffc13aae4200 0000000000000000
[   13.096427] fda0: 0000000000000004 0000000021670570 ffffffc1366cfeb0 ffffffc13a838200
[   13.104371] fdc0: 0000000000000000 000000000000000a ffffff80097b6000 0000000000040987
[   13.112316] fde0: ffffffc1366cfe20 ffffff80081b3af0 ffffffc13a838200 0000000000000000
[   13.120261] fe00: ffffffc1366cfe30 ffffff80081b6b0c ffffffc13aae4200 0000000000000000
[   13.128206] fe20: 0000000000000004 0000000000040987 ffffffc1366cfe70 ffffff80081b7dd8
[   13.136151] fe40: ffffff80097b6000 ffffffc13aae4200 ffffffc13aae4200 fffffffffffffff7
[   13.144096] fe60: 0000000021670570 ffffffc13a8c63c0 0000000000000000 ffffff8008083180
[   13.152042] fe80: ffffffffffffff1d 0000000021670570 ffffffffffffffff 0000007f917ad9b8
[   13.159986] fea0: 0000000020000000 0000000000000015 0000000000000000 0000000000040987
[   13.167930] fec0: 0000000000000001 0000000021670570 0000000000000004 0000000000000000
[   13.175874] fee0: 0000000000000888 0000440110000000 000000000000006d 0000000000000003
[   13.183819] ff00: 0000000000000040 ffffff80ffffffc8 0000000000000000 0000000000000020
[   13.191762] ff20: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
[   13.199707] ff40: 0000000000000000 0000007f917553e0 0000000000000000 0000000000000004
[   13.207651] ff60: 0000000021670570 0000007f91835480 0000000000000004 0000007f91831638
[   13.215595] ff80: 0000000000000004 00000000004b0de0 00000000004b0000 0000000000000000
[   13.223539] ffa0: 0000000000000000 0000007fc92ac8c0 0000007f9175d178 0000007fc92ac8c0
[   13.231483] ffc0: 0000007f917ad9b8 0000000020000000 0000000000000001 0000000000000040
[   13.239427] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.247360] Call trace:
[   13.249866] Exception stack(0xffffffc1366cf7a0 to 0xffffffc1366cf8d0)
[   13.256386] f7a0: 0000000000001000 0000007fffffffff ffffffc1366cf990 ffffff80083c0df8
[   13.264331] f7c0: 0000000060000145 ffffff80089b5001 ffffffc13ab08130 0000000000000001
[   13.272275] f7e0: 0000000000000008 ffffffc13a8568c8 0000000000000000 0000000000000000
[   13.280220] f800: ffffffc1366cf960 ffffffc1366cf960 ffffffc1366cf930 00000000ffffffd8
[   13.288165] f820: ffffff8009931ac0 4554535953425553 4544006273753d4d 3831633d45434956
[   13.296110] f840: ffff003832313a39 ffffff800845926c ffffffc1366cf880 0000000000040987
[   13.304054] f860: 0000000000000000 ffffff8009c57000 0000000000000fff 000000013a5ff000
[   13.311999] f880: 0000000000000000 ffffff800809797c ffffff80098febb1 6c2062756820746f
[   13.319944] f8a0: 65776f702074736f 0000000005f5e0ff ffffff80098feb7f 00000000fffffffe
[   13.327884] f8c0: 0000000000000030 ffffffffffffffff
[   13.332835] [<ffffff80083c0df8>] addr_in_gen_pool+0x4/0x48
[   13.338398] [<ffffff80086004d0>] xhci_mem_cleanup+0xc8/0x51c
[   13.344137] [<ffffff80085f9250>] xhci_resume+0x308/0x65c
[   13.349524] [<ffffff80085e3de8>] xhci_brcm_resume+0x84/0x8c
[   13.355174] [<ffffff80084ad040>] platform_pm_resume+0x3c/0x64
[   13.360997] [<ffffff80084b91b4>] dpm_run_callback+0x5c/0x15c
[   13.366732] [<ffffff80084b96bc>] device_resume+0xc0/0x190
[   13.372205] [<ffffff80084baa70>] dpm_resume+0x144/0x2cc
[   13.377504] [<ffffff80084bafbc>] dpm_resume_end+0x20/0x34
[   13.382980] [<ffffff80080e0d88>] suspend_devices_and_enter+0x104/0x704
[   13.389585] [<ffffff80080e16a8>] pm_suspend+0x320/0x53c
[   13.394881] [<ffffff80080dfd08>] state_store+0xbc/0xe0
[   13.400094] [<ffffff80083a89d4>] kobj_attr_store+0x14/0x24
[   13.405655] [<ffffff800822a614>] sysfs_kf_write+0x60/0x70
[   13.411128] [<ffffff80082295d4>] kernfs_fop_write+0x130/0x194
[   13.416954] [<ffffff80081b5d10>] __vfs_write+0x60/0x150
[   13.422254] [<ffffff80081b6b20>] vfs_write+0xc8/0x164
[   13.427376] [<ffffff80081b7dd8>] SyS_write+0x70/0xc8
[   13.432412] [<ffffff8008083180>] el0_svc_naked+0x34/0x38
[   13.437800] Code: 92800173 97f6fb9e 17fffff5 d1000442 (f8408c03)
[   13.444033] ---[ end trace 2effe12f909ce205 ]---

The call path leading to this problem is xhci_mem_cleanup() ->
dma_free_coherent() -> dma_free_from_pool() -> addr_in_gen_pool. If the
atomic_pool is NULL, we can't possibly have the address in the atomic
pool anyway, so guard against that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-14 14:30:21 +02:00
Thomas Gleixner
e3ff9c3678 timekeeping: Repair ktime_get_coarse*() granularity
Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.

The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.

Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.

Fixes: b9ff604cff ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de
2019-06-14 11:51:44 +02:00
Mathieu Malaterre
1ec0cd8286 PM: hibernate: powerpc: Expose pfn_is_nosave() prototype
The declaration for pfn_is_nosave is only available in
kernel/power/power.h. Since this function can be override in arch,
expose it globally. Having a prototype will make sure to avoid warning
(sometime treated as error with W=1) such as:

  arch/powerpc/kernel/suspend.c:18:5: error: no previous prototype for 'pfn_is_nosave' [-Werror=missing-prototypes]

This moves the declaration into a globally visible header file and add
missing include to avoid a warning on powerpc.

Also remove the duplicated prototypes since not required anymore.

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-14 10:48:56 +02:00
YueHaibing
bc6f2a757d kernel/module: Fix mem leak in module_add_modinfo_attrs
In module_add_modinfo_attrs if sysfs_create_file
fails, we forget to free allocated modinfo_attrs
and roll back the sysfs files.

Fixes: 03e88ae1b1 ("[PATCH] fix module sysfs files reference counting")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-14 09:31:09 +02:00
Dan Williams
50f44ee724 mm/devm_memremap_pages: fix final page put race
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap.  If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback.  Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Dan Williams
2e3f139e8e mm/devm_memremap_pages: introduce devm_memunmap_pages
Use the new devm_release_action() facility to allow
devm_memremap_pages_release() to be manually triggered.

Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Paul E. McKenney
96050c68be rcu: Upgrade sync_exp_work_done() to smp_mb()
The sync_exp_work_done() function uses smp_mb__before_atomic(), but
there is no obvious atomic in the ensuing code.  The ordering is
absolutely required for grace periods to work correctly, so this
commit upgrades the smp_mb__before_atomic() to smp_mb().

Fixes: 6fba2b3767 ("rcu: Remove deprecated RCU debugfs tracing code")
Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13 15:33:19 -07:00
Joel Savitz
d477f8c202 cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off > /sys/devices/system/cpu/smt/control
	# echo on > /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-12 11:00:08 -07:00
Valdis Klētnieks
aee450cbe4 bpf: silence warning messages in core
Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:

kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~
kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~

98 copies of the above.

The attached patch silences the warnings, because we *know* we're overwriting
the default initializer. That leaves bpf/core.c with only 6 other warnings,
which become more visible in comparison.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-12 16:51:02 +02:00
Pavankumar Kondeti
a66d955e91 cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
When "deep" suspend is enabled, all CPUs except the primary CPU are frozen
via CPU hotplug one by one. After all secondary CPUs are unplugged the
wakeup pending condition is evaluated and if pending the suspend operation
is aborted and the secondary CPUs are brought up again.

CPU hotplug is a slow operation, so it makes sense to check for wakeup
pending in the freezer loop before bringing down the next CPU. This
improves the system suspend abort latency significantly.

[ tglx: Massaged changelog and improved printk message ]

Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: iri Kosina <jkosina@suse.cz>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: linux-pm@vger.kernel.org
Link: https://lkml.kernel.org/r/1559536263-16472-1-git-send-email-pkondeti@codeaurora.org
2019-06-12 11:03:05 +02:00
Minwoo Im
0e51833042 genirq/affinity: Remove unused argument from [__]irq_build_affinity_masks()
The *affd argument is neither used in irq_build_affinity_masks() nor
__irq_build_affinity_masks(). Remove it.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Minwoo Im <minwoo.im@samsung.com>
Cc: linux-block@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602112117.31839-1-minwoo.im.dev@gmail.com
2019-06-12 10:52:45 +02:00
Daniel Lezcano
699785f5d8 genirq/timings: Add selftest for next event computation
The circular buffers are now validated with selftests. The next interrupt
index algorithm which is the hardest part to validate needs extra coverage.

Add a selftest which uses the intervals stored in the arrays and insert all
the values except the last one. The next event computation must return the
same value as the last element which was not inserted.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-9-daniel.lezcano@linaro.org
2019-06-12 10:47:05 +02:00
Daniel Lezcano
f52da98d90 genirq/timings: Add selftest for irqs circular buffer
After testing the per cpu interrupt circular event, make sure the per
interrupt circular buffer usage is correct.

Add tests to validate the interrupt circular buffer.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-8-daniel.lezcano@linaro.org
2019-06-12 10:47:05 +02:00